Column Descriptions:
id: (Unique id for each patient)
age: (Age of the patient in years)
origin: (place of study)
sex: (Male/Female)
cp: chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])
trestbps: resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))
chol: (serum cholesterol in mg/dl)
fbs: (if fasting blood sugar > 120 mg/dl)
restecg: (resting electrocardiographic results) – Values: [normal, stt abnormality, lv hypertrophy]
thalach: maximum heart rate achieved
exang: exercise-induced angina (True/ False)
oldpeak: ST depression induced by exercise relative to rest
slope: the slope of the peak exercise ST segment
ca: number of major vessels (0-3) colored by fluoroscopy
thal: [normal; fixed defect; reversible defect]
num: the predicted attribute
Data Creators: Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D. Relevant Papers: Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64,304–310. Web Link David W. Aha & Dennis Kibler. “Instance-based prediction of heart-disease presence with the Cleveland database.” Web Link Gennari, J.H., Langley, P, & Fisher, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40, 11–61. Web Link
Citation Request: The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution. They would be:
Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.
Libraries used:
Let’s load the data set
First we will change some variable names in order to facilitate the project’s understanding.
So now our variables will be studied as:
id -> int; Unique id for each patient.
age -> int; Age of the patient in years.
sex -> chr; “Male” or “Female”.
dataset -> chr; place of study (“Cleveland”, “Hungary”, “Switzerland” and “VA Long Beach”).
chest_pain -> chr; chest pain type ([“typical angina”, “atypical angina”, “non-anginal”, asymptomatic”]).
rest_blood_pressure -> int; resting blood pressure (in mm Hg on admission to the hospital).
chol -> int; serum cholesterol in mg/dl.
blood_sugar -> lgl; if fasting blood sugar > 120 mg/dl (True or False).
restecg -> chr; resting electrocardiographic results [“normal”, “stt abnormality”, “lv hypertrophy”].
max_heart_rate -> int; maximum heart rate achieved.
exang -> lgl; exercise-induced angina (True or False).
oldpeak -> dbl; ST depression induced by exercise relative to rest.
slope -> chr; the slope of the peak exercise ST segment (“downsloping”, “flat” and “upsloping”).
ca -> int; number of major vessels (0-3) colored by fluoroscopy.
thal -> chr; [“normal”; “fixed defect” and “reversible defect”].
stage -> int; the predicted attribute
summary(heart)
## id age sex dataset
## Min. : 1.0 Min. :28.00 Length:920 Length:920
## 1st Qu.:230.8 1st Qu.:47.00 Class :character Class :character
## Median :460.5 Median :54.00 Mode :character Mode :character
## Mean :460.5 Mean :53.51
## 3rd Qu.:690.2 3rd Qu.:60.00
## Max. :920.0 Max. :77.00
##
## chest_pain rest_blood_pressure chol blood_sugar
## Length:920 Min. : 0.0 Min. : 0.0 Mode :logical
## Class :character 1st Qu.:120.0 1st Qu.:175.0 FALSE:692
## Mode :character Median :130.0 Median :223.0 TRUE :138
## Mean :132.1 Mean :199.1 NA's :90
## 3rd Qu.:140.0 3rd Qu.:268.0
## Max. :200.0 Max. :603.0
## NA's :59 NA's :30
## restecg max_heart_rate exang oldpeak
## Length:920 Min. : 60.0 Mode :logical Min. :-2.6000
## Class :character 1st Qu.:120.0 FALSE:528 1st Qu.: 0.0000
## Mode :character Median :140.0 TRUE :337 Median : 0.5000
## Mean :137.5 NA's :55 Mean : 0.8788
## 3rd Qu.:157.0 3rd Qu.: 1.5000
## Max. :202.0 Max. : 6.2000
## NA's :55 NA's :62
## slope ca thal stage
## Length:920 Min. :0.0000 Length:920 Min. :0.0000
## Class :character 1st Qu.:0.0000 Class :character 1st Qu.:0.0000
## Mode :character Median :0.0000 Mode :character Median :1.0000
## Mean :0.6764 Mean :0.9957
## 3rd Qu.:1.0000 3rd Qu.:2.0000
## Max. :3.0000 Max. :4.0000
## NA's :611
By taking a first look at the summary of the data set and analysing it deeply, we get some first conclusions on where to head our Preprocessing of the data:
NA values, especially in ca (611); rest_blood_pressure, chol, blood_sugar, max_heart_rate, exang, oldpeak also have missing values in a much lower quantity. Based on this difference of NAs the way of handling, must be thought thorough, and may be differently depending on the variable, row…
The variables id and dataset do not give relevant information. We believe that the number id does not give a big relevance to identify if a patient suffers or not a heart disease.
Before deleting the variable id we must look for duplicated values, in this case a duplicated value for an id number would give us an observation to get rid off. So before deleting it, we will check it.
Encoding, There are a considerable amount of character variables that must be changed into factors so we can get better results and handle better the data (sex, chest_pain, restecg, slope and thal). We will have to take into consideration the labels and levels.
Outliers, by looking at the numerical variables and their minimum, maximum and quantile ranges. We appreciate the reckon of a deep study on those values.
In order to facilitate comparisons between different variables and for the following subjects of matter of this Project it might be a good idea to normalize and standarise numerical variables.
These main ideas came from a look at the data set information, summary and understanding of the data. Nevertheless, more feature engineering strategies may appear while visualizing the data.
So from this first conclusion we get two main problems, handling the importance of the variables (which variables to delete); the type and values of the variables (factorise); and outstanding values such as NAs and outliers.
Let’s start taking care of duplicated variables and non-relevant variables:
heart$id[duplicated(heart$id)]
## integer(0)
# We see that there are no duplicated rows (no repeated people)
# Now, let's look at the variables to delete
# heart2 = heart %>% select(- id) # second assumption
heart = heart %>% select(- id, - dataset) # main assumption
glimpse(heart)
## Rows: 920
## Columns: 14
## $ age <int> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56…
## $ sex <chr> "Male", "Male", "Male", "Male", "Female", "Male", …
## $ chest_pain <chr> "typical angina", "asymptomatic", "asymptomatic", …
## $ rest_blood_pressure <int> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, …
## $ chol <int> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, …
## $ blood_sugar <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ restecg <chr> "lv hypertrophy", "lv hypertrophy", "lv hypertroph…
## $ max_heart_rate <int> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, …
## $ exang <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRU…
## $ oldpeak <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, …
## $ slope <chr> "downsloping", "flat", "flat", "downsloping", "ups…
## $ ca <int> 0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ thal <chr> "fixed defect", "normal", "reversable defect", "no…
## $ stage <int> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1,…
# glimpse(heart2)
We will factorise the variables that are characters. As PCA and Factor Analysis are numerical techniques, we will not use strings as labels for the categorical variables. Therefore, for a better comprehension of the Project, we will state here the corresponding of each factor number to its actual string value. Logical features will not be factorised as they are handled as ones or zeroes when applying PCA and Factor Analysis techniques.
glimpse(heart)
## Rows: 920
## Columns: 14
## $ age <int> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56…
## $ sex <fct> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ chest_pain <fct> 1, 0, 0, 3, 2, 2, 0, 0, 0, 0, 0, 2, 3, 2, 3, 3, 2,…
## $ rest_blood_pressure <int> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, …
## $ chol <int> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, …
## $ blood_sugar <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ restecg <fct> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0,…
## $ max_heart_rate <int> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, …
## $ exang <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRU…
## $ oldpeak <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, …
## $ slope <fct> 0, 1, 1, 0, 2, 2, 0, 2, 1, 0, 1, 1, 1, 2, 2, 2, 0,…
## $ ca <fct> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1,…
## $ thal <fct> 1, 0, NA, 0, 0, 0, 0, 0, NA, NA, 1, 0, 1, NA, NA, …
## $ stage <fct> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1,…
After this, our final working features will be defined as:
age -> int; Age of the patient in years.
sex -> fct; “Male” or “Female” -> (0, 1)
chest_pain -> fct; chest pain type ([“typical angina”, “atypical angina”, “non-anginal”, “asymptomatic”]) -> (0, 1, 2, 3)
“typical angina” = 0 -> type of chest pain that goes away when resting.
“atypical angina” = 1 -> type of chest pain that does not fit the typical pattern of angina.
“non-anginal” = 2 -> type of chest pain unrelated to heart conditions.
“asymptomatic” = 3 -> absence of chest pain or discomfort.
rest_blood_pressure -> int; resting blood pressure (in mm Hg on admission to the hospital).
chol -> int; serum cholesterol in mg/dl.
blood_sugar -> lgl; if fasting blood sugar > 120 mg/dl (True or False).
If True -> no possibility of having diabetes.
If False -> there are possibilities of having diabetes.
restecg -> fct; resting electrocardiographic results [“normal”, “stt abnormality”, “lv hypertrophy”] -> (0, 1, 2)
“normal” = 0 -> electrical activity of the heart is within the expected range.
“stt abnormality” = 1 -> electrical activity of the heart is showing ST-T abnormalities which can indicate various cardiac conditions.
“lv hypertrophy” = 2 -> electrical activity of the heart is showing Left Ventricular Hypertrophy which suggests that the left ventricle of the heart is thicker or larger than normal. This can be a sign of various heart conditions or high blood pressure.
max_heart_rate -> int; maximum heart rate achieved.
exang -> lgl; exercise-induced angina (True or False).
If True -> individual experienced some type of discomfort during physical activity.
If False -> individual did not experienced any discomfort while exercising.
oldpeak -> dbl; ST depression induced by exercise relative to rest.
slope -> fct; the slope of the peak exercise ST segment (“downsloping”, “flat” and “upsloping”) -> (0, 1, 2)
“downsloping” = 0 -> negative slope during peak exercise, can indicate important heart conditions.
“flat” = 1 -> not showing significant elevation or depression, non-concerning for the patient.
“upsloping” = 2 -> positive slope during peak exercise, can indicate important heart conditions.
ca -> fct; number of major vessels (0-3) colored by fluoroscopy -> (0, 1, 2, 3)
thal -> fct; [“normal”; “fixed defect” and “reversible defect”] -> (0, 1, 2)
stage -> fct; the predicted attribute -> (0,1, 2, 3, 4).
0 -> absence of significant heart conditions.
1 -> early or mild stage of heart conditions, which implies minor heart issues or risk factors.
2 -> moderate level of heart conditions, this stage suggests that the individual’s heart condition is becoming more relevant. Medical intervention may be necessary.
3 -> more advanced or severe heart conditions, which require medical interventions, treatment or surgery..
4 -> advanced heart-disease or conditions that may be life-threatening.
The following step, handle NA values:
barplot(colMeans(is.na(heart)),
names.arg = names(colMeans(is.na(heart))), col = "skyblue",
ylab = "Percentage of NAs",
main = "Missing Values in Heart Disease",
cex.names = 0.8, las = 2)
# We see that ca and thal variables have more than 60% of missing values, so we delete those
heart = heart %>% select(- ca, - thal)
#summary(heart)
# NAs in rest_blood_preassure; chol; max_heart_rate; and oldpeak seem to be easy to take care of
# However, in slope and restecg they are still more than a 15%
# Let's watch out for observations that have a considerable amount of NAs
amountNAs = rowSums(is.na(heart))
count_by_na = table(amountNAs)
percentage_by_na = prop.table(count_by_na) * 100
for (i in 0:max(amountNAs)) {
cat("Percentage of rows with", i, "missing values:",
percentage_by_na[i], "%\n")
}
## Percentage of rows with 0 missing values: %
## Percentage of rows with 1 missing values: 49.78261 %
## Percentage of rows with 2 missing values: 32.06522 %
## Percentage of rows with 3 missing values: 10.54348 %
## Percentage of rows with 4 missing values: 1.413043 %
## Percentage of rows with 5 missing values: 0.326087 %
## Percentage of rows with 6 missing values: 2.608696 %
## Percentage of rows with 7 missing values: 3.152174 %
percentage_by_na[1] + percentage_by_na[2] + percentage_by_na[3]
## 0
## 92.3913
# We see that more than 92% of the data is covered by observations that
# have less or equal than 3 missing values. Therefore, it might be a
# good idea to delete those rows with more than 4 NAs.
# However, let's see first if there is a pattern in the NA values
md.pattern(heart, plot = TRUE, rotate.names = TRUE)
## age sex chest_pain stage chol max_heart_rate exang rest_blood_pressure
## 458 1 1 1 1 1 1 1 1
## 162 1 1 1 1 1 1 1 1
## 73 1 1 1 1 1 1 1 1
## 48 1 1 1 1 1 1 1 1
## 52 1 1 1 1 1 1 1 1
## 12 1 1 1 1 1 1 1 1
## 17 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 0 0 0
## 24 1 1 1 1 1 0 0 0
## 27 1 1 1 1 1 0 0 0
## 7 1 1 1 1 0 1 1 1
## 18 1 1 1 1 0 1 1 1
## 1 1 1 1 1 0 1 1 1
## 1 1 1 1 1 0 1 1 1
## 2 1 1 1 1 0 0 0 0
## 1 1 1 1 1 0 0 0 0
## 0 0 0 0 30 55 55 59
## oldpeak blood_sugar restecg slope
## 458 1 1 1 1 0
## 162 1 1 1 0 1
## 73 1 1 0 1 1
## 48 1 1 0 0 2
## 52 1 0 1 1 1
## 12 1 0 1 0 2
## 17 1 0 0 1 2
## 5 1 0 0 0 3
## 4 0 1 0 0 3
## 2 0 0 1 0 3
## 1 0 0 0 0 4
## 1 1 1 1 1 1
## 1 1 1 0 1 2
## 1 1 1 0 0 3
## 1 0 0 1 0 4
## 1 1 1 0 1 4
## 24 0 1 1 0 5
## 27 0 1 0 0 6
## 7 1 1 1 1 1
## 18 1 1 1 0 2
## 1 1 1 0 1 2
## 1 1 1 0 0 3
## 2 0 1 1 0 6
## 1 0 1 0 0 7
## 62 90 181 309 841
# From the plot, we see that the pattern in the NA values is that
# whenever you have more than or exactly 4 missing values
# The NAs will be in the features rest_blood_preassure, max_heart_rate, exang, oldpeak and slope
# So, maybe these variables will affect possible statistical models by making the model noisy
# We will eliminate those rows with more than 4 (included) features
# with missing values
heart = heart[amountNAs <= 4, ]
#summary(heart)
# Despite the fact that we got rid of the majority of the missing values
# without loosing a relevant information, we still have got a
# considerable amount of NAs in the variable restecg. Let's check again
md.pattern(heart, plot = TRUE, rotate.names = TRUE)
## age sex chest_pain stage max_heart_rate exang rest_blood_pressure oldpeak
## 458 1 1 1 1 1 1 1 1
## 162 1 1 1 1 1 1 1 1
## 73 1 1 1 1 1 1 1 1
## 48 1 1 1 1 1 1 1 1
## 52 1 1 1 1 1 1 1 1
## 12 1 1 1 1 1 1 1 1
## 17 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1 1
## 7 1 1 1 1 1 1 1 1
## 18 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 0
## 2 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 0 1
## 1 1 1 1 1 1 1 0 1
## 1 1 1 1 1 1 1 0 1
## 1 1 1 1 1 1 1 0 0
## 1 1 1 1 1 0 0 0 1
## 0 0 0 0 1 1 5 8
## chol blood_sugar restecg slope
## 458 1 1 1 1 0
## 162 1 1 1 0 1
## 73 1 1 0 1 1
## 48 1 1 0 0 2
## 52 1 0 1 1 1
## 12 1 0 1 0 2
## 17 1 0 0 1 2
## 5 1 0 0 0 3
## 7 0 1 1 1 1
## 18 0 1 1 0 2
## 1 0 1 0 1 2
## 1 0 1 0 0 3
## 4 1 1 0 0 3
## 2 1 0 1 0 3
## 1 1 0 0 0 4
## 1 1 1 1 1 1
## 1 1 1 0 1 2
## 1 1 1 0 0 3
## 1 1 0 1 0 4
## 1 1 1 0 1 4
## 27 90 153 255 540
summary(heart)
## age sex chest_pain rest_blood_pressure chol
## Min. :28.00 0:673 0:470 Min. : 0.0 Min. : 0.0
## 1st Qu.:46.00 1:193 1: 42 1st Qu.:120.0 1st Qu.:174.2
## Median :54.00 2:168 Median :130.0 Median :223.0
## Mean :53.13 3:186 Mean :132.1 Mean :199.0
## 3rd Qu.:60.00 3rd Qu.:140.0 3rd Qu.:268.0
## Max. :77.00 Max. :200.0 Max. :603.0
## blood_sugar restecg max_heart_rate exang oldpeak
## Mode :logical 0:663 Min. : 60.0 Mode :logical Min. :-2.6000
## FALSE:740 1: 0 1st Qu.:120.0 FALSE:529 1st Qu.: 0.0000
## TRUE :126 2:203 Median :140.0 TRUE :337 Median : 0.5000
## Mean :137.6 Mean : 0.8781
## 3rd Qu.:157.0 3rd Qu.: 1.5000
## Max. :202.0 Max. : 6.2000
## slope stage
## 0: 74 0:392
## 1:448 1:252
## 2:344 2:102
## 3: 94
## 4: 26
##
# Even though we chose this, we may consider eliminating those two
# 'problematic' variables
Our data set to work on now has no NA values, however, the values themselves may be incorrect or are not coherent. Hence, treating outliers is going to be crucial for our study. From the summary of the data set, we see that our variables of interest will be: age, rest_blood_pressure, chol, max_heart_rate and oldpeak.
# 3-sigma rule
mu = mean(heart$age)
sigma = sd(heart$age)
sum(heart$age < mu - 3 * sigma | heart$age > mu + 3 * sigma)
## [1] 0
# No 'big' outliers for the age
mu = mean(heart$rest_blood_pressure)
sigma = sd(heart$rest_blood_pressure)
sum(heart$rest_blood_pressure < mu - 3 * sigma |
heart$rest_blood_pressure > mu + 3 * sigma)
## [1] 8
# 8 'big' outliers for the restblood_preassure
mu = mean(heart$chol)
sigma = sd(heart$chol)
sum(heart$chol < mu - 3 * sigma | heart$chol > mu + 3 * sigma)
## [1] 2
# 2 'big' outliers for the chol
mu = mean(heart$max_heart_rate)
sigma = sd(heart$max_heart_rate)
sum(heart$max_heart_rate < mu - 3 * sigma |
heart$max_heart_rate > mu + 3 * sigma)
## [1] 0
# No 'big' outliers for the max_heart_rate
mu = mean(heart$oldpeak)
sigma = sd(heart$oldpeak)
sum(heart$oldpeak < mu - 3 * sigma | heart$oldpeak > mu + 3 * sigma)
## [1] 7
# 7 'big' outliers for the oldpeak
# Next approach (rest_blood_pressure, chol, max_heart_rate and oldpeak)
# Inter Quantile Range
QI = quantile(heart$rest_blood_pressure, 0.25)
QS = quantile(heart$rest_blood_pressure, 0.75)
IQR = QS - QI
sum(heart$rest_blood_pressure < QI - 1.5*IQR |
heart$rest_blood_pressure > QS + 1.5*IQR)
## [1] 28
# 28 values out of the IQR for rest_blood_pressure
QI = quantile(heart$chol, 0.25)
QS = quantile(heart$chol, 0.75)
IQR = QS - QI
sum(heart$chol < QI - 1.5*IQR |
heart$chol > QS + 1.5*IQR)
## [1] 180
# 180 values out of the IQR for chol
QI = quantile(heart$max_heart_rate, 0.25)
QS = quantile(heart$max_heart_rate, 0.75)
IQR = QS - QI
sum(heart$max_heart_rate < QI - 1.5*IQR |
heart$max_heart_rate > QS + 1.5*IQR)
## [1] 2
# 2 values out of the IQR for max_heart_rate
QI = quantile(heart$oldpeak, 0.25)
QS = quantile(heart$oldpeak, 0.75)
IQR = QS - QI
sum(heart$oldpeak < QI - 1.5*IQR |
heart$oldpeak > QS + 1.5*IQR)
## [1] 16
# 16 values out of the IQR for oldpeak
# Let's see that graphically
g1 = ggplot(data = heart) + aes(x = rest_blood_pressure, y = 1) +
geom_boxplot(fill = "lightblue", color = "blue",
outlier.color = "red", outlier.shape = 16) +
xlab("Rest Blood Pressure") + theme_minimal() +
geom_jitter(alpha = 0.7)
# With this first boxplot, we realised that the feature
# rest_blood_pressure had strong outliers below 90 and above 180; with
# a low variability (without those values)
g1
g2 = ggplot(data = heart) + aes(x = chol, y = 1) +
geom_boxplot(fill = "lightblue", color = "blue",
outlier.color = "red", outlier.shape = 16) +
xlab("Cholesterol") + theme_minimal() + geom_jitter(alpha = 0.7)
# Thanks to the second boxplot, we found values that did not make sense
# chol as 0 or above 400; without these values it has a low variability
g2
g3 = ggplot(data = heart) + aes(x = max_heart_rate, y = 1) +
geom_boxplot(fill = "lightblue", color = "blue",
outlier.color = "red", outlier.shape = 16) +
xlab("Max Heart Rate") + theme_minimal() + geom_jitter(alpha = 0.7)
# Due to the geom_jitter() we saw that the values might follow a normal
# distribution, in which outliers did not have a huge impact as long as
# the sample size is sufficiently large. For the feature max_heart_rate
g3
g4 = ggplot(data = heart) + aes(x = oldpeak, y = 1) +
geom_boxplot(fill = "lightblue", color = "blue",
outlier.color = "red", outlier.shape = 16) +
xlab("Old Peak") + theme_minimal() + geom_jitter(alpha = 0.7)
# In this fourth geom_jitter() we saw that the distribution of oldpeak
# was strongly skewed, as a considerable amount of values lied at 0.
# This gave us the reckoning that no matter the outliers, the
# the distribution of the variable is not going to be affected
g4
After seeing which values seem to get out of “normal” bounds, we must study those variables thoroughly.
rest_blood_pressure:
Abnormally low blood pressure < 90 mm Hg (life-threatening)
90 mm Hg <= Normal Blood Pressure<= 120 mm Hg
120 mm Hg < Elevated Blood Pressure <=129 mm Hg
130 mm Hg <= Hypertension Stage 1 <= 139 mm Hg
140 mm Hg <= Hypertension Stage 2 < 180 mm Hg
Hypertensive Crisis >= 180 mm Hg (life-threatening)
We believe that the universities did not took the measurements of patients in a risk of death. If those measurements were to be true, they would be dead by now. So we reckon that those values must be discarded, as they lack of coherence.
chol:
Desirable Total Cholesterol < 200 mg/dl
200 mg/dl <= Borderline High Total Cholesterol <= 239 mg/dl
240 mg/dl <= High Total Cholesterol > 400 mg/dl
From the plot we saw that there are a lot of conglomerated values at 0, which make no sense, as it is impossible that a person has 0 mg/dl of serum cholesterol. This makes us believe that those values as 0, represent some kind of NA or a measurement error. Therefore, handling those values is crucial. Another thing to mention are the extreme values of total cholesterol. Values above 400 mg/dl are a huge health risk and may be errors, however checking the stage on those values to see if they make sense will be a good idea.
So in conclusion for this feature, the problem is the high amount of 0 values and seeing if values above 400 are coherent.
max_heart_rate: the maximum heart rate is quite related to the age. As it follows, when there are no exceptions the next ranges depending on age approximately.
140 to 150 bpm -> Adults 70 and older
150 to 160 bpm -> Adults 60 to 69
160 to 170 bpm -> Adults 50 to 59
170 to 180 bpm -> Adults 40 to 49
180 to 190 bpm -> Adults 30 to 39
190 to 200 bpm -> Adults 20 to 29
There are a small amount of values below 120, we may have to see the distribution of this variable, which seems to be normal based on the small difference between the Median and the Mean. So maybe we should not eliminate outliers here as they are not affecting heavily the distribution of the data. Nevertheless a graphical study of the distribution will verify if this assumption is true or not.
oldpeak: has only 7 outliers from the plot and the summary we see that has a right skewed distribution, heavily produced due to the high amount of values at 0. So eliminating here outliers does not seem a good idea. However, a graphical check of the distribution will be a good way to check our assumption. Taking absolute values we have (note that we can take absolute values as it as measurement in millimeters which means that a negative value shows that the depression was caused in the other sense of the point of reference):
Normal or No ST Depression: oldpeak value of 0 (typically considered no ST depression).
Mild ST Depression: oldpeak values from approximately 0.5 to 1.0 millimeters.
Moderate ST Depression: oldpeak values from approximately 1.1 to 2.0 millimeters.
Severe ST Depression: oldpeak values greater than 2.0 millimeters.
summary(heart) # to check again the values
## age sex chest_pain rest_blood_pressure chol
## Min. :28.00 0:673 0:470 Min. : 0.0 Min. : 0.0
## 1st Qu.:46.00 1:193 1: 42 1st Qu.:120.0 1st Qu.:174.2
## Median :54.00 2:168 Median :130.0 Median :223.0
## Mean :53.13 3:186 Mean :132.1 Mean :199.0
## 3rd Qu.:60.00 3rd Qu.:140.0 3rd Qu.:268.0
## Max. :77.00 Max. :200.0 Max. :603.0
## blood_sugar restecg max_heart_rate exang oldpeak
## Mode :logical 0:663 Min. : 60.0 Mode :logical Min. :-2.6000
## FALSE:740 1: 0 1st Qu.:120.0 FALSE:529 1st Qu.: 0.0000
## TRUE :126 2:203 Median :140.0 TRUE :337 Median : 0.5000
## Mean :137.6 Mean : 0.8781
## 3rd Qu.:157.0 3rd Qu.: 1.5000
## Max. :202.0 Max. : 6.2000
## slope stage
## 0: 74 0:392
## 1:448 1:252
## 2:344 2:102
## 3: 94
## 4: 26
##
# rest_blood_pressure OUTLIERS
# Outliers that make no sense
sum(heart$rest_blood_pressure < 90) # only 2
## [1] 2
sum(heart$rest_blood_pressure >= 180) # 20 values
## [1] 20
# Let's check those values
indeces1 = which((heart$rest_blood_pressure < 90) |
(heart$rest_blood_pressure >= 180))
print(heart[indeces1, ])
## age sex chest_pain rest_blood_pressure chol blood_sugar restecg
## 84 68 0 3 180 274 TRUE 2
## 127 56 1 0 200 288 TRUE 2
## 189 54 0 2 192 283 FALSE 2
## 202 64 1 0 180 325 FALSE 0
## 232 55 1 0 180 327 FALSE 0
## 339 39 0 2 190 241 FALSE 0
## 376 45 1 2 180 166 FALSE 0
## 388 46 0 0 180 280 FALSE 0
## 476 57 1 0 180 347 FALSE 0
## 485 59 0 3 180 213 FALSE 0
## 549 54 0 0 200 198 FALSE 0
## 570 53 0 0 180 285 FALSE 0
## 596 58 1 2 180 393 FALSE 0
## 645 53 0 0 80 0 FALSE 0
## 648 54 0 0 180 0 FALSE 0
## 681 61 0 3 200 0 TRUE 0
## 701 63 0 0 185 0 FALSE 0
## 702 64 1 0 200 0 FALSE 0
## 727 60 0 3 180 0 FALSE 2
## 748 55 0 3 0 0 FALSE 0
## 841 57 0 2 180 285 TRUE 0
## 847 61 0 0 190 287 TRUE 2
## max_heart_rate exang oldpeak slope stage
## 84 150 TRUE 1.6 1 3
## 127 133 TRUE 4.0 0 3
## 189 195 FALSE 0.0 2 1
## 202 154 TRUE 0.0 2 0
## 232 117 TRUE 3.4 1 2
## 339 106 FALSE 0.0 1 0
## 376 180 FALSE 0.0 2 0
## 388 120 FALSE 0.0 1 0
## 476 126 TRUE 0.8 1 0
## 485 100 FALSE 0.0 2 0
## 549 142 TRUE 2.0 1 1
## 570 120 TRUE 1.5 1 1
## 596 110 TRUE 1.0 1 1
## 645 141 TRUE 2.0 0 0
## 648 150 FALSE 1.5 1 1
## 681 70 FALSE 0.0 2 3
## 701 98 TRUE 0.0 2 1
## 702 140 TRUE 1.0 1 3
## 727 140 TRUE 1.5 1 0
## 748 155 FALSE 1.5 1 3
## 841 120 FALSE 0.8 1 1
## 847 150 TRUE 2.0 0 4
# As below 90 there are just 2 observations, we will eliminate those
heart = heart[heart$rest_blood_pressure >= 90, ]
# Now our indeces will be
indeces1 = which(heart$rest_blood_pressure >= 180)
print(heart[indeces1, ])
## age sex chest_pain rest_blood_pressure chol blood_sugar restecg
## 84 68 0 3 180 274 TRUE 2
## 127 56 1 0 200 288 TRUE 2
## 189 54 0 2 192 283 FALSE 2
## 202 64 1 0 180 325 FALSE 0
## 232 55 1 0 180 327 FALSE 0
## 339 39 0 2 190 241 FALSE 0
## 376 45 1 2 180 166 FALSE 0
## 388 46 0 0 180 280 FALSE 0
## 476 57 1 0 180 347 FALSE 0
## 485 59 0 3 180 213 FALSE 0
## 549 54 0 0 200 198 FALSE 0
## 570 53 0 0 180 285 FALSE 0
## 596 58 1 2 180 393 FALSE 0
## 648 54 0 0 180 0 FALSE 0
## 681 61 0 3 200 0 TRUE 0
## 701 63 0 0 185 0 FALSE 0
## 702 64 1 0 200 0 FALSE 0
## 727 60 0 3 180 0 FALSE 2
## 841 57 0 2 180 285 TRUE 0
## 847 61 0 0 190 287 TRUE 2
## max_heart_rate exang oldpeak slope stage
## 84 150 TRUE 1.6 1 3
## 127 133 TRUE 4.0 0 3
## 189 195 FALSE 0.0 2 1
## 202 154 TRUE 0.0 2 0
## 232 117 TRUE 3.4 1 2
## 339 106 FALSE 0.0 1 0
## 376 180 FALSE 0.0 2 0
## 388 120 FALSE 0.0 1 0
## 476 126 TRUE 0.8 1 0
## 485 100 FALSE 0.0 2 0
## 549 142 TRUE 2.0 1 1
## 570 120 TRUE 1.5 1 1
## 596 110 TRUE 1.0 1 1
## 648 150 FALSE 1.5 1 1
## 681 70 FALSE 0.0 2 3
## 701 98 TRUE 0.0 2 1
## 702 140 TRUE 1.0 1 3
## 727 140 TRUE 1.5 1 0
## 841 120 FALSE 0.8 1 1
## 847 150 TRUE 2.0 0 4
# For the rest 20 values that have a rest_blood_pressure above 180
# we will study them later
# chol OUTLIERS
sum(heart$chol == 0) # huge amount (165 values)
## [1] 166
sum(heart$chol > 400) # very few (13 values)
## [1] 14
# Let's check those values
indeces2_1 = which((heart$chol == 0))
indeces2_2 = which((heart$chol > 400))
print(heart[indeces2_1, ])
## age sex chest_pain rest_blood_pressure chol blood_sugar restecg
## 335 39 0 2 120 0 FALSE 2
## 401 48 0 2 100 0 FALSE 0
## 530 38 0 0 110 0 FALSE 0
## 543 52 0 0 170 0 FALSE 0
## 548 54 0 0 140 0 FALSE 0
## 582 66 0 0 140 0 FALSE 0
## 598 32 0 1 95 0 FALSE 0
## 599 34 0 0 115 0 FALSE 2
## 600 35 0 0 112 0 FALSE 0
## 601 36 0 0 110 0 TRUE 0
## 602 38 1 0 105 0 FALSE 0
## 603 38 1 0 110 0 FALSE 0
## 604 38 0 3 100 0 FALSE 0
## 605 38 0 3 115 0 FALSE 0
## 606 38 0 0 135 0 FALSE 0
## 607 38 0 0 150 0 FALSE 0
## 608 40 0 0 95 0 FALSE 0
## 609 41 0 0 125 0 FALSE 0
## 610 42 0 0 105 0 FALSE 0
## 611 42 0 0 145 0 FALSE 0
## 612 43 0 0 100 0 FALSE 0
## 613 43 0 0 115 0 FALSE 0
## 614 43 0 0 140 0 FALSE 0
## 615 45 0 3 110 0 FALSE 0
## 616 46 0 0 100 0 FALSE 0
## 617 46 0 0 115 0 FALSE 0
## 618 47 0 3 110 0 FALSE 0
## 619 47 0 3 155 0 FALSE 0
## 620 47 0 0 110 0 FALSE 0
## 621 47 0 0 160 0 FALSE 0
## 622 48 0 0 115 0 FALSE 0
## 623 50 1 0 160 0 FALSE 0
## 624 50 0 0 115 0 FALSE 0
## 625 50 0 0 120 0 FALSE 0
## 626 50 0 0 145 0 FALSE 0
## 627 51 1 0 120 0 TRUE 0
## 628 51 0 0 110 0 FALSE 0
## 629 51 0 0 120 0 TRUE 0
## 630 51 0 0 130 0 FALSE 0
## 631 51 0 0 130 0 FALSE 0
## 632 51 0 0 140 0 FALSE 0
## 633 51 0 0 95 0 FALSE 0
## 634 52 0 0 130 0 FALSE 0
## 635 52 0 0 135 0 FALSE 0
## 636 52 0 0 165 0 FALSE 0
## 637 52 0 0 95 0 FALSE 0
## 638 53 0 2 120 0 FALSE 0
## 639 53 0 2 130 0 FALSE 0
## 640 53 0 3 105 0 FALSE 0
## 641 53 0 3 160 0 FALSE 2
## 642 53 0 0 120 0 FALSE 0
## 643 53 0 0 125 0 FALSE 0
## 644 53 0 0 130 0 FALSE 2
## 646 54 0 0 120 0 FALSE 0
## 647 54 0 0 130 0 FALSE 0
## 648 54 0 0 180 0 FALSE 0
## 649 55 0 2 140 0 FALSE 0
## 650 55 0 0 115 0 FALSE 0
## 651 55 0 0 120 0 FALSE 0
## 652 55 0 0 140 0 FALSE 0
## 653 56 0 3 120 0 FALSE 0
## 654 56 0 3 125 0 FALSE 0
## 655 56 0 3 155 0 FALSE 0
## 656 56 0 0 115 0 FALSE 0
## 657 56 0 0 120 0 FALSE 2
## 658 56 0 0 120 0 FALSE 0
## 659 56 0 0 125 0 TRUE 0
## 660 56 0 0 140 0 FALSE 0
## 661 57 0 3 105 0 FALSE 0
## 662 57 0 0 110 0 FALSE 0
## 663 57 0 0 140 0 FALSE 0
## 664 57 0 0 140 0 FALSE 0
## 665 57 0 0 160 0 TRUE 0
## 666 57 0 0 95 0 FALSE 0
## 667 58 0 0 115 0 FALSE 0
## 668 58 0 0 130 0 FALSE 0
## 669 58 0 0 170 0 FALSE 0
## 670 59 0 3 125 0 FALSE 0
## 671 59 0 0 110 0 TRUE 0
## 672 59 0 0 120 0 FALSE 0
## 673 59 0 0 125 0 FALSE 0
## 674 59 0 0 135 0 FALSE 0
## 675 60 0 3 115 0 FALSE 0
## 676 60 0 0 125 0 FALSE 0
## 677 60 0 0 130 0 FALSE 0
## 678 60 0 0 135 0 FALSE 0
## 679 60 0 0 160 0 FALSE 0
## 680 60 0 0 160 0 TRUE 0
## 681 61 0 3 200 0 TRUE 0
## 682 61 0 0 105 0 FALSE 0
## 683 61 0 0 110 0 FALSE 0
## 684 61 0 0 125 0 FALSE 0
## 685 61 0 0 130 0 FALSE 2
## 686 61 0 0 130 0 TRUE 0
## 687 61 0 0 150 0 FALSE 0
## 688 61 0 0 150 0 FALSE 0
## 689 61 0 0 160 0 TRUE 0
## 690 62 1 1 140 0 FALSE 0
## 691 62 1 0 120 0 FALSE 0
## 692 62 0 1 120 0 FALSE 2
## 693 62 0 3 160 0 FALSE 0
## 694 62 0 0 115 0 FALSE 0
## 695 62 0 0 115 0 FALSE 0
## 696 62 0 0 150 0 TRUE 0
## 697 63 0 0 100 0 FALSE 0
## 698 63 0 0 140 0 FALSE 2
## 699 63 0 0 150 0 FALSE 0
## 700 63 0 0 150 0 FALSE 0
## 701 63 0 0 185 0 FALSE 0
## 702 64 1 0 200 0 FALSE 0
## 703 64 1 0 95 0 FALSE 0
## 704 64 0 0 110 0 FALSE 0
## 705 65 0 0 115 0 FALSE 0
## 706 65 0 0 145 0 FALSE 0
## 707 65 0 0 155 0 FALSE 0
## 708 65 0 0 160 0 TRUE 0
## 709 66 1 0 155 0 FALSE 0
## 710 66 0 0 150 0 FALSE 0
## 711 67 0 1 145 0 FALSE 2
## 712 68 0 0 135 0 FALSE 0
## 713 68 0 0 145 0 FALSE 0
## 714 69 0 0 135 0 FALSE 0
## 715 70 0 0 115 0 FALSE 0
## 716 70 0 0 140 0 TRUE 0
## 717 72 0 3 160 0 FALSE 2
## 718 73 1 3 160 0 FALSE 0
## 719 74 0 2 145 0 FALSE 0
## 725 66 0 3 120 0 FALSE 0
## 727 60 0 3 180 0 FALSE 2
## 728 60 0 3 120 0 TRUE 0
## 731 59 0 0 140 0 FALSE 0
## 732 62 0 0 110 0 FALSE 0
## 733 57 0 0 128 0 TRUE 0
## 737 63 0 0 126 0 FALSE 2
## 738 60 0 0 152 0 FALSE 0
## 739 58 0 0 116 0 FALSE 0
## 740 64 0 0 120 0 TRUE 0
## 741 63 0 3 130 0 FALSE 0
## 742 52 0 3 128 0 FALSE 2
## 743 69 0 0 130 0 TRUE 0
## 749 52 0 3 122 0 FALSE 0
## 750 64 0 0 144 0 FALSE 0
## 751 60 0 0 120 0 FALSE 0
## 752 59 0 0 154 0 FALSE 0
## 753 61 0 3 120 0 FALSE 0
## 754 40 0 0 125 0 TRUE 0
## 755 61 0 0 125 0 TRUE 0
## 756 41 0 0 104 0 FALSE 0
## 757 63 0 0 136 0 FALSE 0
## 759 51 0 0 128 0 FALSE 0
## 760 59 0 3 125 0 FALSE 0
## 762 55 0 3 120 0 FALSE 0
## 765 53 0 0 126 0 FALSE 0
## 766 68 0 0 138 0 FALSE 0
## 767 53 0 0 154 0 FALSE 0
## 768 59 0 0 178 0 TRUE 2
## 769 61 0 0 110 0 FALSE 0
## 771 56 0 3 170 0 FALSE 2
## 772 58 0 2 126 0 TRUE 0
## 773 69 0 3 140 0 FALSE 0
## 775 58 0 0 120 0 FALSE 2
## 781 49 0 1 130 0 FALSE 0
## 794 67 0 0 120 0 TRUE 0
## 798 43 0 0 122 0 FALSE 0
## 799 63 0 3 130 0 TRUE 0
## 802 48 0 3 102 0 FALSE 0
## max_heart_rate exang oldpeak slope stage
## 335 146 FALSE 2.0 2 0
## 401 100 FALSE 0.0 1 0
## 530 150 TRUE 1.0 1 1
## 543 126 TRUE 1.5 1 1
## 548 118 TRUE 0.0 1 1
## 582 94 TRUE 1.0 1 1
## 598 127 FALSE 0.7 2 1
## 599 154 FALSE 0.2 2 1
## 600 130 TRUE 1.4 0 3
## 601 125 TRUE 1.0 1 1
## 602 166 FALSE 2.8 2 2
## 603 156 FALSE 0.0 1 1
## 604 179 FALSE -1.1 2 0
## 605 128 TRUE 0.0 1 1
## 606 150 FALSE 0.0 2 2
## 607 120 TRUE 2.0 1 1
## 608 144 FALSE 0.0 2 2
## 609 176 FALSE 1.6 2 2
## 610 128 TRUE -1.5 0 1
## 611 99 TRUE 0.0 1 2
## 612 122 FALSE 1.5 0 3
## 613 145 TRUE 2.0 1 4
## 614 140 TRUE 0.5 2 2
## 615 138 FALSE -0.1 2 0
## 616 133 FALSE -2.6 1 1
## 617 113 TRUE 1.5 1 1
## 618 120 TRUE 0.0 1 1
## 619 118 TRUE 1.0 1 3
## 620 149 FALSE 2.1 2 2
## 621 124 TRUE 0.0 1 1
## 622 128 FALSE 0.0 1 2
## 623 110 FALSE 0.0 0 1
## 624 120 TRUE 0.5 1 3
## 625 156 TRUE 0.0 2 3
## 626 139 TRUE 0.7 1 1
## 627 127 TRUE 1.5 2 2
## 628 92 FALSE 0.0 1 4
## 629 104 FALSE 0.0 1 3
## 630 170 FALSE -0.7 2 2
## 631 163 FALSE 0.0 2 1
## 632 60 FALSE 0.0 1 2
## 633 126 FALSE 2.2 1 2
## 634 120 FALSE 0.0 1 2
## 635 128 TRUE 2.0 1 2
## 636 122 TRUE 1.0 2 2
## 637 82 TRUE 1.5 1 2
## 638 95 FALSE 0.0 1 3
## 639 120 FALSE 0.7 0 0
## 640 115 FALSE 0.0 1 1
## 641 122 TRUE 0.0 2 1
## 642 120 FALSE 0.0 1 1
## 643 120 FALSE 1.5 2 4
## 644 135 TRUE 1.0 1 2
## 646 155 FALSE 0.0 1 2
## 647 110 TRUE 3.0 1 3
## 648 150 FALSE 1.5 1 1
## 649 150 FALSE 0.2 2 0
## 650 155 FALSE 0.1 1 1
## 651 92 FALSE 0.3 2 4
## 652 83 FALSE 0.0 1 2
## 653 97 FALSE 0.0 1 0
## 654 98 FALSE -2.0 1 2
## 655 99 FALSE 0.0 1 2
## 656 82 FALSE -1.0 2 1
## 657 100 TRUE -1.0 0 2
## 658 148 FALSE 0.0 1 2
## 659 103 TRUE 1.0 1 3
## 660 121 TRUE 1.8 2 1
## 661 148 FALSE 0.3 1 1
## 662 131 TRUE 1.4 2 3
## 663 120 TRUE 2.0 1 2
## 664 100 TRUE 0.0 1 3
## 665 98 TRUE 2.0 1 2
## 666 182 FALSE 0.7 0 1
## 667 138 FALSE 0.5 2 1
## 668 100 TRUE 1.0 1 4
## 669 105 TRUE 0.0 1 1
## 670 175 FALSE 2.6 1 1
## 671 94 FALSE 0.0 1 3
## 672 115 FALSE 0.0 1 2
## 673 119 TRUE 0.9 2 1
## 674 115 TRUE 1.0 1 1
## 675 143 FALSE 2.4 2 1
## 676 110 FALSE 0.1 2 3
## 677 130 TRUE 1.1 0 1
## 678 63 TRUE 0.5 2 3
## 679 99 TRUE 0.5 1 3
## 680 149 FALSE 0.4 1 1
## 681 70 FALSE 0.0 2 3
## 682 110 TRUE 1.5 2 1
## 683 113 FALSE 1.4 1 1
## 684 105 TRUE 0.0 0 3
## 685 115 FALSE 0.0 1 3
## 686 77 FALSE 2.5 1 3
## 687 105 TRUE 0.0 1 1
## 688 117 TRUE 2.0 1 2
## 689 145 FALSE 1.0 1 2
## 690 143 FALSE 0.0 1 2
## 691 123 TRUE 1.7 0 1
## 692 134 FALSE -0.8 1 1
## 693 72 TRUE 0.0 1 3
## 694 128 TRUE 2.5 0 2
## 695 72 TRUE -0.5 1 1
## 696 78 FALSE 2.0 1 3
## 697 109 FALSE -0.9 1 1
## 698 149 FALSE 2.0 2 2
## 699 86 TRUE 2.0 1 3
## 700 154 FALSE 3.7 2 3
## 701 98 TRUE 0.0 2 1
## 702 140 TRUE 1.0 1 3
## 703 145 FALSE 1.1 0 1
## 704 114 TRUE 1.3 0 1
## 705 93 TRUE 0.0 1 1
## 706 67 FALSE 0.0 1 3
## 707 154 FALSE 1.0 2 0
## 708 122 FALSE 1.5 0 3
## 709 90 FALSE 0.0 1 1
## 710 108 TRUE 2.0 1 3
## 711 125 FALSE 0.0 1 2
## 712 120 TRUE 0.0 2 3
## 713 136 FALSE 1.8 2 1
## 714 130 FALSE 0.0 1 1
## 715 92 TRUE 0.0 1 1
## 716 157 TRUE 2.0 1 3
## 717 114 FALSE 1.6 1 0
## 718 121 FALSE 0.0 2 1
## 719 123 FALSE 1.3 2 1
## 725 120 FALSE -0.5 2 0
## 727 140 TRUE 1.5 1 0
## 728 141 TRUE 2.0 2 3
## 731 117 TRUE 1.0 1 1
## 732 120 TRUE 0.5 1 1
## 733 148 TRUE 1.0 1 1
## 737 120 FALSE 1.5 0 0
## 738 118 TRUE 0.0 1 0
## 739 124 FALSE 1.0 2 2
## 740 106 FALSE 2.0 1 1
## 741 111 TRUE 0.0 1 3
## 742 180 FALSE 3.0 2 2
## 743 129 FALSE 1.0 1 2
## 749 110 TRUE 2.0 0 2
## 750 122 TRUE 1.0 1 3
## 751 133 TRUE 2.0 2 0
## 752 131 TRUE 1.5 0 0
## 753 80 TRUE 0.0 1 3
## 754 165 FALSE 0.0 1 1
## 755 86 FALSE 1.5 1 3
## 756 111 FALSE 0.0 1 0
## 757 84 TRUE 0.0 2 2
## 759 107 FALSE 0.0 1 0
## 760 128 TRUE 2.0 0 2
## 762 125 TRUE 2.5 1 1
## 765 106 FALSE 0.0 2 1
## 766 130 TRUE 3.0 1 2
## 767 140 TRUE 1.5 1 2
## 768 120 TRUE 0.0 1 1
## 769 108 TRUE 2.0 0 2
## 771 123 TRUE 2.5 1 4
## 772 110 TRUE 2.0 1 2
## 773 118 FALSE 2.5 0 2
## 775 106 TRUE 1.5 0 1
## 781 145 FALSE 3.0 1 2
## 794 150 FALSE 1.5 0 3
## 798 120 FALSE 0.5 2 1
## 799 160 FALSE 3.0 1 0
## 802 110 TRUE 1.0 0 1
print(heart[indeces2_2, ])
## age sex chest_pain rest_blood_pressure chol blood_sugar restecg
## 49 65 1 3 140 417 TRUE 2
## 122 63 1 0 150 407 FALSE 2
## 153 67 1 3 115 564 FALSE 2
## 182 56 1 0 134 409 FALSE 2
## 348 40 0 3 140 412 FALSE 0
## 374 44 0 0 150 412 FALSE 0
## 435 53 1 2 113 468 FALSE 0
## 501 40 0 0 120 466 FALSE 0
## 529 32 0 0 118 529 FALSE 0
## 547 54 0 0 130 603 TRUE 0
## 567 52 0 0 140 404 FALSE 0
## 569 53 0 3 145 518 FALSE 0
## 585 44 0 0 135 491 FALSE 0
## 784 58 0 0 132 458 TRUE 0
## max_heart_rate exang oldpeak slope stage
## 49 157 FALSE 0.8 2 0
## 122 154 FALSE 4.0 1 4
## 153 160 FALSE 1.6 1 0
## 182 150 TRUE 1.9 1 2
## 348 188 FALSE 0.0 2 0
## 374 170 FALSE 0.0 2 0
## 435 127 FALSE 0.0 1 0
## 501 152 TRUE 1.0 1 1
## 529 130 FALSE 0.0 2 1
## 547 125 TRUE 1.0 1 1
## 567 124 TRUE 2.0 1 1
## 569 130 FALSE 0.0 1 1
## 585 135 FALSE 0.0 1 1
## 784 69 FALSE 1.0 0 0
# chol over 400 seems to be in just a few cases in which the rest of
# the results are coherent and cohesive. Therefore, instead of deleting
# these obeservations a change of the chol by the mean of the chol
# that have their corresponding age group will be the approach
# groups of age -> NEW FEATURE
# 70 and older; 60 to 69; 50 to 59; 40 to 49; 30 to 39; 20 to 29
heart$ageGroups = numeric(nrow(heart))
for (i in 1:nrow(heart)){
if (20 <= heart$age[i] && heart$age[i] <= 29){
heart$ageGroups[i] = 0
}else if (30 <= heart$age[i] && heart$age[i] <= 39){
heart$ageGroups[i] = 1
}else if (40 <= heart$age[i] && heart$age[i] <= 49){
heart$ageGroups[i] = 2
}else if (50 <= heart$age[i] && heart$age[i] <= 59){
heart$ageGroups[i] = 3
}else if (60 <= heart$age[i] && heart$age[i] <= 69){
heart$ageGroups[i] = 4
}else{
heart$ageGroups[i] = 5
}
}
# Then we factorise this new variable
heart$ageGroups = as.factor(heart$ageGroups)
min(heart$age[which((heart$chol > 400))]) # minimum age
## [1] 32
max(heart$age[which((heart$chol > 400))]) # maximum age
## [1] 67
# So we need to compute the mean of the group of ages 1, 2, 3 and 4
# when chol is != 0 and <= 400
mean1 = mean(heart$chol[heart$chol <= 400 & heart$chol != 0 &
heart$ageGroups == 1])
mean1
## [1] 230.0484
mean1 = ceiling(mean1) # chol is an integer
mean1
## [1] 231
mean2 = mean(heart$chol[heart$chol <= 400 & heart$chol != 0 &
heart$ageGroups == 2])
mean2
## [1] 238.75
mean2 = ceiling(mean2) # chol is an integer
mean2
## [1] 239
mean3 = mean(heart$chol[heart$chol <= 400 & heart$chol != 0 &
heart$ageGroups == 3])
mean3
## [1] 244.5591
mean3 = ceiling(mean3) # chol is an integer
mean3
## [1] 245
mean4 = mean(heart$chol[heart$chol <= 400 & heart$chol != 0 &
heart$ageGroups == 4])
mean4
## [1] 249.3985
mean4 = ceiling(mean4) # chol is an integer
mean4
## [1] 250
# Then we change the value where chol > 400 by its corresponding mean
for (i in 1:length(indeces2_2)){
index_to_use = indeces2_2[i]
if (heart$ageGroups[index_to_use] == 1){
heart$chol[index_to_use] = mean1
}else if (heart$ageGroups[index_to_use] == 2){
heart$chol[index_to_use] = mean2
}else if (heart$ageGroups[index_to_use] == 3){
heart$chol[index_to_use] = mean3
}else{
heart$chol[index_to_use] = mean4
}
}
# Now the problem is in the observations that have chol as 0
# For chol = 0 we will be studying other variables and seek for intersections and relations later
print(heart[indeces2_1, ])
## age sex chest_pain rest_blood_pressure chol blood_sugar restecg
## 335 39 0 2 120 0 FALSE 2
## 401 48 0 2 100 0 FALSE 0
## 530 38 0 0 110 0 FALSE 0
## 543 52 0 0 170 0 FALSE 0
## 548 54 0 0 140 0 FALSE 0
## 582 66 0 0 140 0 FALSE 0
## 598 32 0 1 95 0 FALSE 0
## 599 34 0 0 115 0 FALSE 2
## 600 35 0 0 112 0 FALSE 0
## 601 36 0 0 110 0 TRUE 0
## 602 38 1 0 105 0 FALSE 0
## 603 38 1 0 110 0 FALSE 0
## 604 38 0 3 100 0 FALSE 0
## 605 38 0 3 115 0 FALSE 0
## 606 38 0 0 135 0 FALSE 0
## 607 38 0 0 150 0 FALSE 0
## 608 40 0 0 95 0 FALSE 0
## 609 41 0 0 125 0 FALSE 0
## 610 42 0 0 105 0 FALSE 0
## 611 42 0 0 145 0 FALSE 0
## 612 43 0 0 100 0 FALSE 0
## 613 43 0 0 115 0 FALSE 0
## 614 43 0 0 140 0 FALSE 0
## 615 45 0 3 110 0 FALSE 0
## 616 46 0 0 100 0 FALSE 0
## 617 46 0 0 115 0 FALSE 0
## 618 47 0 3 110 0 FALSE 0
## 619 47 0 3 155 0 FALSE 0
## 620 47 0 0 110 0 FALSE 0
## 621 47 0 0 160 0 FALSE 0
## 622 48 0 0 115 0 FALSE 0
## 623 50 1 0 160 0 FALSE 0
## 624 50 0 0 115 0 FALSE 0
## 625 50 0 0 120 0 FALSE 0
## 626 50 0 0 145 0 FALSE 0
## 627 51 1 0 120 0 TRUE 0
## 628 51 0 0 110 0 FALSE 0
## 629 51 0 0 120 0 TRUE 0
## 630 51 0 0 130 0 FALSE 0
## 631 51 0 0 130 0 FALSE 0
## 632 51 0 0 140 0 FALSE 0
## 633 51 0 0 95 0 FALSE 0
## 634 52 0 0 130 0 FALSE 0
## 635 52 0 0 135 0 FALSE 0
## 636 52 0 0 165 0 FALSE 0
## 637 52 0 0 95 0 FALSE 0
## 638 53 0 2 120 0 FALSE 0
## 639 53 0 2 130 0 FALSE 0
## 640 53 0 3 105 0 FALSE 0
## 641 53 0 3 160 0 FALSE 2
## 642 53 0 0 120 0 FALSE 0
## 643 53 0 0 125 0 FALSE 0
## 644 53 0 0 130 0 FALSE 2
## 646 54 0 0 120 0 FALSE 0
## 647 54 0 0 130 0 FALSE 0
## 648 54 0 0 180 0 FALSE 0
## 649 55 0 2 140 0 FALSE 0
## 650 55 0 0 115 0 FALSE 0
## 651 55 0 0 120 0 FALSE 0
## 652 55 0 0 140 0 FALSE 0
## 653 56 0 3 120 0 FALSE 0
## 654 56 0 3 125 0 FALSE 0
## 655 56 0 3 155 0 FALSE 0
## 656 56 0 0 115 0 FALSE 0
## 657 56 0 0 120 0 FALSE 2
## 658 56 0 0 120 0 FALSE 0
## 659 56 0 0 125 0 TRUE 0
## 660 56 0 0 140 0 FALSE 0
## 661 57 0 3 105 0 FALSE 0
## 662 57 0 0 110 0 FALSE 0
## 663 57 0 0 140 0 FALSE 0
## 664 57 0 0 140 0 FALSE 0
## 665 57 0 0 160 0 TRUE 0
## 666 57 0 0 95 0 FALSE 0
## 667 58 0 0 115 0 FALSE 0
## 668 58 0 0 130 0 FALSE 0
## 669 58 0 0 170 0 FALSE 0
## 670 59 0 3 125 0 FALSE 0
## 671 59 0 0 110 0 TRUE 0
## 672 59 0 0 120 0 FALSE 0
## 673 59 0 0 125 0 FALSE 0
## 674 59 0 0 135 0 FALSE 0
## 675 60 0 3 115 0 FALSE 0
## 676 60 0 0 125 0 FALSE 0
## 677 60 0 0 130 0 FALSE 0
## 678 60 0 0 135 0 FALSE 0
## 679 60 0 0 160 0 FALSE 0
## 680 60 0 0 160 0 TRUE 0
## 681 61 0 3 200 0 TRUE 0
## 682 61 0 0 105 0 FALSE 0
## 683 61 0 0 110 0 FALSE 0
## 684 61 0 0 125 0 FALSE 0
## 685 61 0 0 130 0 FALSE 2
## 686 61 0 0 130 0 TRUE 0
## 687 61 0 0 150 0 FALSE 0
## 688 61 0 0 150 0 FALSE 0
## 689 61 0 0 160 0 TRUE 0
## 690 62 1 1 140 0 FALSE 0
## 691 62 1 0 120 0 FALSE 0
## 692 62 0 1 120 0 FALSE 2
## 693 62 0 3 160 0 FALSE 0
## 694 62 0 0 115 0 FALSE 0
## 695 62 0 0 115 0 FALSE 0
## 696 62 0 0 150 0 TRUE 0
## 697 63 0 0 100 0 FALSE 0
## 698 63 0 0 140 0 FALSE 2
## 699 63 0 0 150 0 FALSE 0
## 700 63 0 0 150 0 FALSE 0
## 701 63 0 0 185 0 FALSE 0
## 702 64 1 0 200 0 FALSE 0
## 703 64 1 0 95 0 FALSE 0
## 704 64 0 0 110 0 FALSE 0
## 705 65 0 0 115 0 FALSE 0
## 706 65 0 0 145 0 FALSE 0
## 707 65 0 0 155 0 FALSE 0
## 708 65 0 0 160 0 TRUE 0
## 709 66 1 0 155 0 FALSE 0
## 710 66 0 0 150 0 FALSE 0
## 711 67 0 1 145 0 FALSE 2
## 712 68 0 0 135 0 FALSE 0
## 713 68 0 0 145 0 FALSE 0
## 714 69 0 0 135 0 FALSE 0
## 715 70 0 0 115 0 FALSE 0
## 716 70 0 0 140 0 TRUE 0
## 717 72 0 3 160 0 FALSE 2
## 718 73 1 3 160 0 FALSE 0
## 719 74 0 2 145 0 FALSE 0
## 725 66 0 3 120 0 FALSE 0
## 727 60 0 3 180 0 FALSE 2
## 728 60 0 3 120 0 TRUE 0
## 731 59 0 0 140 0 FALSE 0
## 732 62 0 0 110 0 FALSE 0
## 733 57 0 0 128 0 TRUE 0
## 737 63 0 0 126 0 FALSE 2
## 738 60 0 0 152 0 FALSE 0
## 739 58 0 0 116 0 FALSE 0
## 740 64 0 0 120 0 TRUE 0
## 741 63 0 3 130 0 FALSE 0
## 742 52 0 3 128 0 FALSE 2
## 743 69 0 0 130 0 TRUE 0
## 749 52 0 3 122 0 FALSE 0
## 750 64 0 0 144 0 FALSE 0
## 751 60 0 0 120 0 FALSE 0
## 752 59 0 0 154 0 FALSE 0
## 753 61 0 3 120 0 FALSE 0
## 754 40 0 0 125 0 TRUE 0
## 755 61 0 0 125 0 TRUE 0
## 756 41 0 0 104 0 FALSE 0
## 757 63 0 0 136 0 FALSE 0
## 759 51 0 0 128 0 FALSE 0
## 760 59 0 3 125 0 FALSE 0
## 762 55 0 3 120 0 FALSE 0
## 765 53 0 0 126 0 FALSE 0
## 766 68 0 0 138 0 FALSE 0
## 767 53 0 0 154 0 FALSE 0
## 768 59 0 0 178 0 TRUE 2
## 769 61 0 0 110 0 FALSE 0
## 771 56 0 3 170 0 FALSE 2
## 772 58 0 2 126 0 TRUE 0
## 773 69 0 3 140 0 FALSE 0
## 775 58 0 0 120 0 FALSE 2
## 781 49 0 1 130 0 FALSE 0
## 794 67 0 0 120 0 TRUE 0
## 798 43 0 0 122 0 FALSE 0
## 799 63 0 3 130 0 TRUE 0
## 802 48 0 3 102 0 FALSE 0
## max_heart_rate exang oldpeak slope stage ageGroups
## 335 146 FALSE 2.0 2 0 1
## 401 100 FALSE 0.0 1 0 2
## 530 150 TRUE 1.0 1 1 1
## 543 126 TRUE 1.5 1 1 3
## 548 118 TRUE 0.0 1 1 3
## 582 94 TRUE 1.0 1 1 4
## 598 127 FALSE 0.7 2 1 1
## 599 154 FALSE 0.2 2 1 1
## 600 130 TRUE 1.4 0 3 1
## 601 125 TRUE 1.0 1 1 1
## 602 166 FALSE 2.8 2 2 1
## 603 156 FALSE 0.0 1 1 1
## 604 179 FALSE -1.1 2 0 1
## 605 128 TRUE 0.0 1 1 1
## 606 150 FALSE 0.0 2 2 1
## 607 120 TRUE 2.0 1 1 1
## 608 144 FALSE 0.0 2 2 2
## 609 176 FALSE 1.6 2 2 2
## 610 128 TRUE -1.5 0 1 2
## 611 99 TRUE 0.0 1 2 2
## 612 122 FALSE 1.5 0 3 2
## 613 145 TRUE 2.0 1 4 2
## 614 140 TRUE 0.5 2 2 2
## 615 138 FALSE -0.1 2 0 2
## 616 133 FALSE -2.6 1 1 2
## 617 113 TRUE 1.5 1 1 2
## 618 120 TRUE 0.0 1 1 2
## 619 118 TRUE 1.0 1 3 2
## 620 149 FALSE 2.1 2 2 2
## 621 124 TRUE 0.0 1 1 2
## 622 128 FALSE 0.0 1 2 2
## 623 110 FALSE 0.0 0 1 3
## 624 120 TRUE 0.5 1 3 3
## 625 156 TRUE 0.0 2 3 3
## 626 139 TRUE 0.7 1 1 3
## 627 127 TRUE 1.5 2 2 3
## 628 92 FALSE 0.0 1 4 3
## 629 104 FALSE 0.0 1 3 3
## 630 170 FALSE -0.7 2 2 3
## 631 163 FALSE 0.0 2 1 3
## 632 60 FALSE 0.0 1 2 3
## 633 126 FALSE 2.2 1 2 3
## 634 120 FALSE 0.0 1 2 3
## 635 128 TRUE 2.0 1 2 3
## 636 122 TRUE 1.0 2 2 3
## 637 82 TRUE 1.5 1 2 3
## 638 95 FALSE 0.0 1 3 3
## 639 120 FALSE 0.7 0 0 3
## 640 115 FALSE 0.0 1 1 3
## 641 122 TRUE 0.0 2 1 3
## 642 120 FALSE 0.0 1 1 3
## 643 120 FALSE 1.5 2 4 3
## 644 135 TRUE 1.0 1 2 3
## 646 155 FALSE 0.0 1 2 3
## 647 110 TRUE 3.0 1 3 3
## 648 150 FALSE 1.5 1 1 3
## 649 150 FALSE 0.2 2 0 3
## 650 155 FALSE 0.1 1 1 3
## 651 92 FALSE 0.3 2 4 3
## 652 83 FALSE 0.0 1 2 3
## 653 97 FALSE 0.0 1 0 3
## 654 98 FALSE -2.0 1 2 3
## 655 99 FALSE 0.0 1 2 3
## 656 82 FALSE -1.0 2 1 3
## 657 100 TRUE -1.0 0 2 3
## 658 148 FALSE 0.0 1 2 3
## 659 103 TRUE 1.0 1 3 3
## 660 121 TRUE 1.8 2 1 3
## 661 148 FALSE 0.3 1 1 3
## 662 131 TRUE 1.4 2 3 3
## 663 120 TRUE 2.0 1 2 3
## 664 100 TRUE 0.0 1 3 3
## 665 98 TRUE 2.0 1 2 3
## 666 182 FALSE 0.7 0 1 3
## 667 138 FALSE 0.5 2 1 3
## 668 100 TRUE 1.0 1 4 3
## 669 105 TRUE 0.0 1 1 3
## 670 175 FALSE 2.6 1 1 3
## 671 94 FALSE 0.0 1 3 3
## 672 115 FALSE 0.0 1 2 3
## 673 119 TRUE 0.9 2 1 3
## 674 115 TRUE 1.0 1 1 3
## 675 143 FALSE 2.4 2 1 4
## 676 110 FALSE 0.1 2 3 4
## 677 130 TRUE 1.1 0 1 4
## 678 63 TRUE 0.5 2 3 4
## 679 99 TRUE 0.5 1 3 4
## 680 149 FALSE 0.4 1 1 4
## 681 70 FALSE 0.0 2 3 4
## 682 110 TRUE 1.5 2 1 4
## 683 113 FALSE 1.4 1 1 4
## 684 105 TRUE 0.0 0 3 4
## 685 115 FALSE 0.0 1 3 4
## 686 77 FALSE 2.5 1 3 4
## 687 105 TRUE 0.0 1 1 4
## 688 117 TRUE 2.0 1 2 4
## 689 145 FALSE 1.0 1 2 4
## 690 143 FALSE 0.0 1 2 4
## 691 123 TRUE 1.7 0 1 4
## 692 134 FALSE -0.8 1 1 4
## 693 72 TRUE 0.0 1 3 4
## 694 128 TRUE 2.5 0 2 4
## 695 72 TRUE -0.5 1 1 4
## 696 78 FALSE 2.0 1 3 4
## 697 109 FALSE -0.9 1 1 4
## 698 149 FALSE 2.0 2 2 4
## 699 86 TRUE 2.0 1 3 4
## 700 154 FALSE 3.7 2 3 4
## 701 98 TRUE 0.0 2 1 4
## 702 140 TRUE 1.0 1 3 4
## 703 145 FALSE 1.1 0 1 4
## 704 114 TRUE 1.3 0 1 4
## 705 93 TRUE 0.0 1 1 4
## 706 67 FALSE 0.0 1 3 4
## 707 154 FALSE 1.0 2 0 4
## 708 122 FALSE 1.5 0 3 4
## 709 90 FALSE 0.0 1 1 4
## 710 108 TRUE 2.0 1 3 4
## 711 125 FALSE 0.0 1 2 4
## 712 120 TRUE 0.0 2 3 4
## 713 136 FALSE 1.8 2 1 4
## 714 130 FALSE 0.0 1 1 4
## 715 92 TRUE 0.0 1 1 5
## 716 157 TRUE 2.0 1 3 5
## 717 114 FALSE 1.6 1 0 5
## 718 121 FALSE 0.0 2 1 5
## 719 123 FALSE 1.3 2 1 5
## 725 120 FALSE -0.5 2 0 4
## 727 140 TRUE 1.5 1 0 4
## 728 141 TRUE 2.0 2 3 4
## 731 117 TRUE 1.0 1 1 3
## 732 120 TRUE 0.5 1 1 4
## 733 148 TRUE 1.0 1 1 3
## 737 120 FALSE 1.5 0 0 4
## 738 118 TRUE 0.0 1 0 4
## 739 124 FALSE 1.0 2 2 3
## 740 106 FALSE 2.0 1 1 4
## 741 111 TRUE 0.0 1 3 4
## 742 180 FALSE 3.0 2 2 3
## 743 129 FALSE 1.0 1 2 4
## 749 110 TRUE 2.0 0 2 3
## 750 122 TRUE 1.0 1 3 4
## 751 133 TRUE 2.0 2 0 4
## 752 131 TRUE 1.5 0 0 3
## 753 80 TRUE 0.0 1 3 4
## 754 165 FALSE 0.0 1 1 2
## 755 86 FALSE 1.5 1 3 4
## 756 111 FALSE 0.0 1 0 2
## 757 84 TRUE 0.0 2 2 4
## 759 107 FALSE 0.0 1 0 3
## 760 128 TRUE 2.0 0 2 3
## 762 125 TRUE 2.5 1 1 3
## 765 106 FALSE 0.0 2 1 3
## 766 130 TRUE 3.0 1 2 4
## 767 140 TRUE 1.5 1 2 3
## 768 120 TRUE 0.0 1 1 3
## 769 108 TRUE 2.0 0 2 4
## 771 123 TRUE 2.5 1 4 3
## 772 110 TRUE 2.0 1 2 3
## 773 118 FALSE 2.5 0 2 4
## 775 106 TRUE 1.5 0 1 3
## 781 145 FALSE 3.0 1 2 2
## 794 150 FALSE 1.5 0 3 4
## 798 120 FALSE 0.5 2 1 2
## 799 160 FALSE 3.0 1 0 4
## 802 110 TRUE 1.0 0 1 2
# Let's study the percentage of observations that have 0 as chol
count_chol_0 = sum(heart$chol == 0)
total_observations = nrow(heart)
percentage_chol_0 = (count_chol_0 / total_observations) * 100
cat("Percentage of rows with chol equal to 0:",
percentage_chol_0, "%\n")
## Percentage of rows with chol equal to 0: 19.21296 %
# For the max_heart_rate we will see if our assumption was true
g5 = ggplot(data = heart) + aes(x = max_heart_rate, color = "orange") +
geom_density(alpha = 0.5, fill = "orange") +
labs(title = "Distribution of the Max Heart Rate",
xlab = "Max Heart Rate")
g5
# We see that it tends to be a normal distribution skewed, so because of
# this, the IQR and the 3-sigma rule, we will not delete any values
# We will procceed the same for oldpeak
g6 = ggplot(data = heart) + aes(x = oldpeak, color = "navy") +
geom_density(alpha = 0.5, fill = "navy")+
labs(title = "Distribution of the Oldpeak",
xlab = "Oldpeak")
g6
# We see that our assumption was right, we do not need to eliminate
# oldpeak outliers
summary(heart)
## age sex chest_pain rest_blood_pressure chol
## Min. :28.00 0:671 0:469 Min. : 92.0 Min. : 0.0
## 1st Qu.:46.00 1:193 1: 42 1st Qu.:120.0 1st Qu.:175.0
## Median :54.00 2:168 Median :130.0 Median :223.0
## Mean :53.13 3:185 Mean :132.3 Mean :195.8
## 3rd Qu.:60.00 3rd Qu.:140.0 3rd Qu.:265.0
## Max. :77.00 Max. :200.0 Max. :394.0
## blood_sugar restecg max_heart_rate exang oldpeak
## Mode :logical 0:661 Min. : 60.0 Mode :logical Min. :-2.600
## FALSE:738 1: 0 1st Qu.:120.0 FALSE:528 1st Qu.: 0.000
## TRUE :126 2:203 Median :140.0 TRUE :336 Median : 0.500
## Mean :137.6 Mean : 0.876
## 3rd Qu.:157.2 3rd Qu.: 1.500
## Max. :202.0 Max. : 6.200
## slope stage ageGroups
## 0: 73 0:391 0: 4
## 1:447 1:252 1: 75
## 2:344 2:102 2:209
## 3: 93 3:353
## 4: 26 4:196
## 5: 27
Checkpoint -> so far our handling of the outliers has fulfilled the following things stated before:
rest_blood_pressure -> values below 90 (still not values above 180 which are 20).
chol -> values above 400 have been substituted by the mean of its corresponding age group. However the values that are 0 (more than 160 values) have still not been taken care of.
max_heart_rate and oldpeak -> our assumption that the possible few outliers did not affect the distribution of both features was right. Therefore they were not deleted.
To sum up, the last outliers to analyse are the rest_blood_pressure values above 180 and the chol values that are 0.
# First we will look for the values that are both 0 in chol and above
# 180 in rest_blood_pressure
common_indeces = which(heart$chol == 0 &
heart$rest_blood_pressure >= 180)
common_indeces
## [1] 647 680 700 701 726
# We delete those observations
heart = heart[- common_indeces, ]
summary(heart)
## age sex chest_pain rest_blood_pressure chol
## Min. :28.00 0:667 0:466 Min. : 92 Min. : 0.0
## 1st Qu.:46.00 1:192 1: 42 1st Qu.:120 1st Qu.:176.5
## Median :54.00 2:168 Median :130 Median :224.0
## Mean :53.08 3:183 Mean :132 Mean :197.0
## 3rd Qu.:60.00 3rd Qu.:140 3rd Qu.:265.0
## Max. :77.00 Max. :200 Max. :394.0
## blood_sugar restecg max_heart_rate exang oldpeak
## Mode :logical 0:657 Min. : 60.0 Mode :logical Min. :-2.6000
## FALSE:734 1: 0 1st Qu.:120.0 FALSE:526 1st Qu.: 0.0000
## TRUE :125 2:202 Median :140.0 TRUE :333 Median : 0.5000
## Mean :137.7 Mean : 0.8765
## 3rd Qu.:158.0 3rd Qu.: 1.5000
## Max. :202.0 Max. : 6.2000
## slope stage ageGroups
## 0: 73 0:390 0: 4
## 1:444 1:250 1: 75
## 2:342 2:102 2:209
## 3: 91 3:352
## 4: 26 4:192
## 5: 27
# Let's check how many outliers for chol and rest_blood_pressure
# are left
sum(heart$chol == 0) # 159 left
## [1] 161
sum(heart$rest_blood_pressure >= 180) # 15 left
## [1] 15
# We will study to which range of ages belong this variables
max(heart$age[which(heart$chol == 0)])
## [1] 74
min(heart$age[which(heart$chol == 0)])
## [1] 32
max(heart$age[which(heart$rest_blood_pressure >= 180)])
## [1] 68
min(heart$age[which(heart$rest_blood_pressure >= 180)])
## [1] 39
# We see both variables cover almost the same ranges
heart[which(heart$chol == 0), ]
heart[which(heart$rest_blood_pressure >= 180), ]
# We see that the values at the "frontier" are the wide majority
# So the value strictly greater are
sum(heart$rest_blood_pressure > 180)
## [1] 5
# We will delete those observations and keep the rest 180 values at
# the "frontier"
indeces_delete = which(heart$rest_blood_pressure > 180)
heart = heart[- indeces_delete, ]
summary(heart)
## age sex chest_pain rest_blood_pressure chol
## Min. :28.00 0:663 0:463 Min. : 92.0 Min. : 0.0
## 1st Qu.:46.00 1:191 1: 42 1st Qu.:120.0 1st Qu.:175.2
## Median :54.00 2:166 Median :130.0 Median :223.0
## Mean :53.09 3:183 Mean :131.6 Mean :196.6
## 3rd Qu.:60.00 3rd Qu.:140.0 3rd Qu.:265.0
## Max. :77.00 Max. :180.0 Max. :394.0
## blood_sugar restecg max_heart_rate exang oldpeak
## Mode :logical 0:655 Min. : 60.0 Mode :logical Min. :-2.6000
## FALSE:731 1: 0 1st Qu.:120.0 FALSE:524 1st Qu.: 0.0000
## TRUE :123 2:199 Median :140.0 TRUE :330 Median : 0.5000
## Mean :137.6 Mean : 0.8722
## 3rd Qu.:158.0 3rd Qu.: 1.5000
## Max. :202.0 Max. : 6.2000
## slope stage ageGroups
## 0: 71 0:389 0: 4
## 1:442 1:248 1: 74
## 2:341 2:102 2:209
## 3: 90 3:349
## 4: 25 4:191
## 5: 27
# In the end, we will procceed with the remaining 0s in chol by the
# mean of group of age; remember the ages covered from 32 to 76
last_indeces = which(heart$chol == 0)
for (i in 1:length(last_indeces)){
to_use = last_indeces[i]
if (heart$ageGroups[to_use] == 1){
heart$chol[to_use] = mean1
}else if (heart$ageGroups[to_use] == 2){
heart$chol[to_use] = mean2
}else if (heart$ageGroups[to_use] == 3){
heart$chol[to_use] = mean3
}else{
heart$chol[to_use] = mean4
}
}
summary(heart)
## age sex chest_pain rest_blood_pressure chol
## Min. :28.00 0:663 0:463 Min. : 92.0 Min. : 85.0
## 1st Qu.:46.00 1:191 1: 42 1st Qu.:120.0 1st Qu.:216.0
## Median :54.00 2:166 Median :130.0 Median :245.0
## Mean :53.09 3:183 Mean :131.6 Mean :242.8
## 3rd Qu.:60.00 3rd Qu.:140.0 3rd Qu.:265.0
## Max. :77.00 Max. :180.0 Max. :394.0
## blood_sugar restecg max_heart_rate exang oldpeak
## Mode :logical 0:655 Min. : 60.0 Mode :logical Min. :-2.6000
## FALSE:731 1: 0 1st Qu.:120.0 FALSE:524 1st Qu.: 0.0000
## TRUE :123 2:199 Median :140.0 TRUE :330 Median : 0.5000
## Mean :137.6 Mean : 0.8722
## 3rd Qu.:158.0 3rd Qu.: 1.5000
## Max. :202.0 Max. : 6.2000
## slope stage ageGroups
## 0: 71 0:389 0: 4
## 1:442 1:248 1: 74
## 2:341 2:102 2:209
## 3: 90 3:349
## 4: 25 4:191
## 5: 27
# New boxplot where we have "new" outliers that are not errors and make
# sense
g7 = ggplot(data = heart) + aes(x = heart$chol) +
geom_boxplot(fill = "lightblue", color = "blue",
outlier.color = "red", outlier.shape = 16) +
geom_jitter(aes(y = 0), alpha = 0.5) + theme_minimal() +
xlab("Cholesterol")
# This boxplot of the variable chol have a few outliers which actually make sense
g7
We have finally have our data ready to go. However, normalising and standarising it could be a great idea to compare the variability and the results between features.
We need to normalize the numerical variables in order to have the same scale, without losing any information.
# We save the data set before normalizing for future plots
heart_not_norm <- heart
heart$age <- (heart$age - min(heart$age))/(max(heart$age) - min(heart$age))
heart$rest_blood_pressure <- (heart$rest_blood_pressure - min(heart$rest_blood_pressure))/(max(heart$rest_blood_pressure) - min(heart$rest_blood_pressure))
heart$chol <- (heart$chol - min(heart$chol))/(max(heart$chol) - min(heart$chol))
heart$max_heart_rate <- (heart$max_heart_rate - min(heart$max_heart_rate))/(max
(heart$max_heart_rate) - min(heart$max_heart_rate))
heart$oldpeak <- (heart$oldpeak - min(heart$oldpeak))/(max(heart$oldpeak) - min(heart$oldpeak))
summary(heart)
## age sex chest_pain rest_blood_pressure chol
## Min. :0.0000 0:663 0:463 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.3673 1:191 1: 42 1st Qu.:0.3182 1st Qu.:0.4239
## Median :0.5306 2:166 Median :0.4318 Median :0.5178
## Mean :0.5120 3:183 Mean :0.4499 Mean :0.5106
## 3rd Qu.:0.6531 3rd Qu.:0.5455 3rd Qu.:0.5825
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## blood_sugar restecg max_heart_rate exang oldpeak
## Mode :logical 0:655 Min. :0.0000 Mode :logical Min. :0.0000
## FALSE:731 1: 0 1st Qu.:0.4225 FALSE:524 1st Qu.:0.2955
## TRUE :123 2:199 Median :0.5634 TRUE :330 Median :0.3523
## Mean :0.5467 Mean :0.3946
## 3rd Qu.:0.6901 3rd Qu.:0.4659
## Max. :1.0000 Max. :1.0000
## slope stage ageGroups
## 0: 71 0:389 0: 4
## 1:442 1:248 1: 74
## 2:341 2:102 2:209
## 3: 90 3:349
## 4: 25 4:191
## 5: 27
In addition to the plots we have already created during the feature engineering process, we are going to create some additional ones to gain a deeper understanding of the dataset.
As a first approach for visualizing the data we are going to create a variable that divides the dataset in 2 group (boolean), where true will be that they have any stage of disease (this may be useful for visualization the future results).
heart <- heart %>%
mutate(heart_disease = heart$stage != 0)
heart$heart_disease = as_factor(heart$heart_disease)
heart_not_norm$heart_disease = as_factor(heart$heart_disease)
sex_plot = factor(heart$sex, levels = c(0, 1),
labels = c("Male", "Female"))
chest_pain_plot = factor(heart$chest_pain,
labels = c("asymptomatic",
"typical angina",
"atypical angina",
"non-anginal"),
levels = c(0, 1, 2, 3))
# Using this new variable will make much easier the visualization of the data
heart_not_norm %>% ggplot(aes(x=heart_disease, y = max_heart_rate)) + geom_boxplot(fill="lightblue") +
xlab("Heart Disease?") +
ylab("Max Heart Rate") +
labs(title = "Heart Disease vs Max Heart Rate ")
People with a low max heart rate tend to have a heart disease.
ggplot(data = heart_not_norm, aes(x = age, y = max_heart_rate, color = sex_plot)) +
geom_point()+
geom_smooth()+
xlab("Age") +
ylab("Maximum Heart Rate") +
labs(title = "Relationship Between Max Heart Rate and Age",
subtitle = "Divided by gender",
color = "Gender")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We can see that there is a very soft correlation between these variables, at higher ages the max heart rates become lower. We can also appreciate that
ggplot(data = heart, aes(x = chest_pain, fill = chest_pain_plot)) +
geom_bar() +
xlab("Chest Pain") +
ylab("Count") +
labs(title = "Number of Observations by Type of Chest Pain",
fill = "Chest Pain")
The majority of our observations have asymptomatic chest pain.
ggplot(data = heart, aes(x = chest_pain, y = heart_not_norm$max_heart_rate,
color = chest_pain_plot)) +
geom_boxplot() +
geom_jitter(size = 1, alpha = 0.2) +
facet_grid(~ slope, labeller = labeller(slope = c("0" = "Downsloping",
"1" = "Flat",
"2" = "Upsloping"))) +
xlab("Chest Pain") +
ylab("Max Heart Rate") +
labs(title = "Impact of Blood Pressure in Appearance of Chest Pain",
subtitle = "Divided by slope groups",
color = "Chest Pain")
We can see that the few people that has a “downsloping” slope belong to the asymptomatic chest pain group. Besides, we can see that the “upsloping” group has, in general, a higher heart rate than the other slope groups. Finally, the people with asymptomatic chest pain (group 0) tend to present a lower maximum heart rate with respect to the others.
ggplot(data = heart_not_norm, aes(x = chol)) +
geom_histogram(binwidth = 30, fill = "blue", color = "black")
The cholesterol is distributed normally.
ggplot(data = heart_not_norm, aes(x = sex, y = age, fill = sex_plot)) +
geom_violin()+
geom_boxplot(alpha = 0.2) +
geom_jitter(alpha = 0.3) +
facet_grid(.~ exang) +
xlab("Gender") +
ylab("Age") +
labs(title = "Age and Gender depending of the Exang",
subtitle = "Exang - Exercise induced angina (True / False)",
fill = "Gender")
Older people tend to have more an exercise induced angina, no matter their gender.
ggplot(data = heart, aes(x = rest_blood_pressure, fill = sex_plot)) +
geom_density(alpha = 0.5) +
xlab("Rest Blood Pressure") +
labs(fill = "Gender",
title = "Density of Blood Pressure at Rest by Gender")
The blood pressure at rest of a person does not depend on its gender.
ggplot(data = heart, aes(x = blood_sugar, fill = sex_plot)) +
geom_bar(position = "fill") +
xlab("Blood Sugar") +
ylab("Percentage") +
labs(title = "Gender Percentages by Blood Sugar > 120 mg/dl",
fill = "Gender")
The values are pretty similar for both groups of blood sugar. However, the majority of women are <120 mg/dl while the majority of men have a blood sugar level of >120 mg/dl.
ggplot(data = heart_not_norm, aes(x = rest_blood_pressure,
y = max_heart_rate,
color = heart_disease)) +
geom_point() +
facet_wrap(~ sex, labeller = labeller(sex = c("0" = "Male",
"1" = "Female"))) +
xlab("Rest Blood Pressure") +
ylab("Max Heart Rate") +
labs(title = "Rest Blood Pressure vs Max Heart Rate divided by Gender",
color = "Heart Disease")
We can see a higher range of heart rates for males than for females. Besides, there is no apparent correlation between heart rate and blood pressure.
ggplot(data = heart, aes(x = chest_pain_plot, fill = exang)) +
geom_bar(position = "dodge") +
facet_grid(sex ~ ., labeller = labeller(sex = c("0" = "Male",
"1" = "Female"))) +
xlab("Chest Pain Type") +
ylab("Count") +
labs(title = "# of People with each Chest Pain Type",
subtitle = "Divided by Gender and Exercise-induced angina")
There were less meassurements for gender female than for gender male. They seem to have the same tendency overall with more or less the same proportions. The differences of proportions are not noticeable, however, very few females had exercise-induced angina.
ggplot(data = heart_not_norm, aes(x = exang, y = max_heart_rate, fill = slope)) +
geom_boxplot() +
xlab("Exercise-Induced Angina")+
ylab("Max Heart Rate") +
labs(title= "Heart Rates of the different slope groups divided by exang",
subtitle = "exang - exercise-induced angina")
The heart rates of all kinds of slope values are higher for the people without an exercise-induced angina. Besides, the group with level 2 slope has the highest max_heart_rate value no matter the exang.
library(corrplot)
## corrplot 0.92 loaded
correlation_matrix <- cor(heart[, sapply(heart, is.numeric)])
corrplot(correlation_matrix, method = "color", tl.cex = 0.55)
We can appreciate that the max_heart_rate is negatively related to all the other variables. We can also highlight that the strongest relationship is between age and max_heart_rate.
library(GGally)
ggpairs(heart, columns = c("age", "max_heart_rate", "chol", "rest_blood_pressure"),
aes(color = sex),
lower = list(geom_smooth(method = "lm"),
mapping = aes(fill = sex)))
No important relationships obtained.
We are going to start by making a brief descriptive analysis:
Dimension 1: univariate analysis.
heart_pca = heart[,c("age", "rest_blood_pressure", "chol", "max_heart_rate",
"oldpeak")]
boxplot(heart_pca, las=2, col="lightblue")
Dimension 2: bivariate analysis.
ggcorr(heart_pca, label = TRUE)
cor_mat = cor(heart_pca)
heatmap(cor_mat)
We can see that there are no big relationships between the features since all the values obtained by the correlation matrix are very low, which will make the analysis more difficult.
We can start with the Principal Components.
pca = prcomp(heart_pca, scale=T)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.3103 0.9884 0.9515 0.9051 0.7626
## Proportion of Variance 0.3434 0.1954 0.1811 0.1638 0.1163
## Cumulative Proportion 0.3434 0.5388 0.7198 0.8837 1.0000
We can see that with the 2 first PCs we can only explain a 54% of the variability of the data.
# eigen(cor(heart_numeric))
# screeplot(pca,main="Screeplot",col="blue",type="barplot",pch=19)
fviz_screeplot(pca, addlabels = TRUE)
# First Component
barplot(pca$rotation[,1], las=2, col="#E26D5C")
# Let's try with squared loadings instead
fviz_contrib(pca, choice = "var", axes = 1)
This first component shows us that the most significant fact in the appearance of any sort of heart disease is the age. This is followed by the maximum heart rate, the oldpeak and the blood pressure, which are all around 20%. Finally, the cholesterol does not seem to have a noticeable impact.
Moreover, let us search any further kind of relationship obtained by the PC1 ranking of observations.
# The worst
heart$stage[order(pca$x[,1])][(length(heart)-5):length(heart)]
## [1] 0 3 3 4 3 2
## Levels: 0 1 2 3 4
# The best
heart$stage[order(pca$x[,1])][1:10]
## [1] 0 3 3 2 2 1 4 1 0 3
## Levels: 0 1 2 3 4
We cannot obtain any useful information just with 1 component related to the disease.
head(get_pca_ind(pca)$contrib[,1]) # this is in %, that is between 0 and 100
## 1 2 3 4 5 6
## 0.10464184 0.45709259 0.12929655 0.05240512 0.13978004 0.05574121
head((pca$x[,1]^2)/(pca$sdev[1]^2))/dim(heart_pca)[1] # this is between 0 and 1
## 1 2 3 4 5 6
## 0.0010464184 0.0045709259 0.0012929655 0.0005240512 0.0013978004 0.0005574121
fviz_contrib(pca, choice = "ind", axes = 1, top=100)
# Second Component
barplot(pca$rotation[,2], las=2, col="darkblue")
fviz_contrib(pca, choice = "var", axes = 2)
In this case we obtained that the highest contribution to this Second Principal Component was given by the cholesterol level. The remaining features are almost insignificant. Since rest_blood_pressure, max_heart_rate and cholesterol are positive we understand that having a high cholesterol level did also affect to the other two variables somehow.
We can now plot the PC1 and PC2 to try to obtain some extra information:
biplot(pca)
fviz_pca_var(pca, col.var = "contrib", col.ind = heart$stages)
# fviz_pca_biplot(pca, repel = TRUE)
fviz_pca_biplot(pca, label = "var",
habillage=heart$heart_disease,
addEllipses=TRUE,
ellipse.level=0.95,
ggtheme = theme_minimal(),
legend.title = "Health Disease")
We can see that even if both people with and without heart disease are pretty distributed randomly, we can identify a trend. The points that are in the direction of the negative variables of the first principal component (age, old_peak, rest_blood_pressure) tend to represent the people with the heart disease, while the points in the direction of the max_heart_rate tend to be the scores of the healthy group.
highest_ranked = heart$stage[order(get_pca_ind(pca)$contrib[,1],decreasing=T)][1:10]
# percentage of the highest ranked people that are indeed healthy
(sum(highest_ranked == 0)/length(highest_ranked))*100
## [1] 70
We obtained that 70% of the highest ranked people by the PC does not have a heart disease. This may be related to the high influence of max_heart_rate in the PC1 and in having or not a heart disease.
data.frame(z1=-pca$x[,1],z2=pca$x[,2]) %>%
ggplot(aes(z1,z2,label= heart$stage, color=heart_not_norm$age)) +
labs(title="PCA", x="PC1", y="PC2", color = "Age") +
theme_bw() +
scale_color_gradient(low="#FFE1A8", high="#E26D5C")+
theme(legend.position="bottom") +
geom_text(size=3, hjust=0.6, vjust=0, check_overlap = TRUE)
It is notable that the age values are very related to the PC1. It makes a lot of sense since the higher contribution to that PC was indeed the age.
data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>%
ggplot(aes(z1,z2,label=heart$stage, color=heart_not_norm$chol)) +
labs(title="PCA", x="PC1", y="PC2", color = "Cholesterol") +
theme_bw() +
scale_color_gradient(low="#FFE1A8", high="#E26D5C")+
theme(legend.position="bottom") +
geom_text(size=3, hjust=0.6, vjust=0, check_overlap = TRUE)
As we expected, the first principal component does not change with cholesterol but, on the other hand, the PC2 has very different cholesterol values for the positive than for the negative side.
data.frame(z1=pca$x[,1],z2=heart$max_heart_rate) %>%
ggplot(aes(z1,z2,label=heart$stage,color=heart_not_norm$rest_blood_pressure)) +
labs(title="Performance", x="PC1", y="Max heart rate") +
scale_color_gradient(low="#FFE1A8", high="#E26D5C") +
theme_bw() +
theme(legend.position="bottom") +
geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)
Let us plot this in a different way in order to obtain a more visual representation of it.
ggplot()+
aes(x = pca$x[,1], y = heart_not_norm$max_heart_rate, color = heart$heart_disease) +
geom_point()+
geom_smooth(color = "cyan")+
xlab("PC1")+
ylab("Max Heart Rate") +
labs(color = "Heart Disease")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
This plot shows how the max heart rate increases with the First Principal component. Moreover, the values where the max heart rate is higher we obtain a greater number of healthy people. Therefore, heart diseases are related with low heart rates.
# Third Component
barplot(pca$rotation[,3], las=2, col="darkblue")
fviz_contrib(pca, choice = "var", axes = 3)
In this case, the Third Principal Component is predominantly influenced by blood pressure and heart rate. The positive values of rest_blood_pressure, max_heart_rate, and oldpeak in the barplot suggest that this PC primarily relates to the individual’s heartbeat and blood circulation.
df_pca = data.frame(PC1 = pca$x[,1],
PC2 = pca$x[,2],
PC3 = pca$x[,3])
ggpairs(df_pca, columns = c("PC1", "PC2", "PC3"),
aes(color = heart$heart_disease),
lower = list(geom_smooth(method = "lm"),
mapping = aes(fill = heart$heart_disease)))
This plot shows us that the PC1 is the main component to separate the people with a heart disease from the people without it. Besides, since the highest contribution to this component was given by the age, we can conclude that it is the most important variable in the apparition of heart diseases.
This data set has a peculiar characteristic, which is the amount of categorical variables in detriment of the numerical ones. However, we can take this to our advantage. We know that categorical variables, that are factors, can then be transformed to numerical. Nevertheless, most of the times categorical variables have a huge impact to the results. So, we thought of using 3 auxiliary data sets for the purpose of this analysis. One will have only the numerical features, whereas the other one will be composed of all the possible variables turned into numerical when needed without our target variable (stage) as that would have a huge weight. Also, to clarify our assumptions that the variable stage is taking all the weight of the model, we will do another one with all the variables with no exception.
This way we could compare results between the 2 data sets and also compare with some individual plots of the categorical variables. As it was done in the Lab 5-6 related to clustering.
glimpse(heart)
## Rows: 854
## Columns: 14
## $ age <dbl> 0.7142857, 0.7959184, 0.7959184, 0.1836735, 0.2653…
## $ sex <fct> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ chest_pain <fct> 1, 0, 0, 3, 2, 2, 0, 0, 0, 0, 0, 2, 3, 2, 3, 3, 2,…
## $ rest_blood_pressure <dbl> 0.6022727, 0.7727273, 0.3181818, 0.4318182, 0.4318…
## $ chol <dbl> 0.4789644, 0.6504854, 0.4660194, 0.5339806, 0.3851…
## $ blood_sugar <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ restecg <fct> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0,…
## $ max_heart_rate <dbl> 0.6338028, 0.3380282, 0.4859155, 0.8943662, 0.7887…
## $ exang <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRU…
## $ oldpeak <dbl> 0.5568182, 0.4659091, 0.5909091, 0.6931818, 0.4545…
## $ slope <fct> 0, 1, 1, 0, 2, 2, 0, 2, 1, 0, 1, 1, 1, 2, 2, 2, 0,…
## $ stage <fct> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1,…
## $ ageGroups <fct> 4, 4, 4, 1, 2, 3, 4, 3, 4, 3, 3, 3, 3, 2, 3, 3, 2,…
## $ heart_disease <fct> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALS…
# Let's first delete the created variables used before
heart_fa1 = heart %>% select(-heart_disease, - ageGroups)
# we delete the targe variable
heart_fa1 = heart_fa1 %>% select(-stage)
# Data set with all variables (converting to numerical when needed)
for(i in which(sapply(heart_fa1, class) == "factor")) {
heart_fa1[,i] <- as.numeric(as.character(heart_fa1[,i]))
}
# Then we normalise the variables that were categorical and
# had different values from [0, 1] strictly:
# chest_pain; restcg; slope
heart_fa1$chest_pain = (heart_fa1$chest_pain -
min(heart_fa1$chest_pain)) /
(max(heart_fa1$chest_pain)
- min(heart_fa1$chest_pain))
heart_fa1$restecg = (heart_fa1$restecg -
min(heart_fa1$restecg)) /
(max(heart_fa1$restecg)
- min(heart_fa1$restecg))
heart_fa1$slope = (heart_fa1$slope -
min(heart_fa1$slope)) /
(max(heart_fa1$slope)
- min(heart_fa1$slope))
# Now we have our full numerical data set ready to go
# Data set with just the numerical variables
heart_fa2 = heart[sapply(heart, is.numeric)]
# Factor Analysis for all variables
fa1 = factanal(heart_fa1, factors = 3, rotation="none",
scores = "regression")
fa1
##
## Call:
## factanal(x = heart_fa1, factors = 3, scores = "regression", rotation = "none")
##
## Uniquenesses:
## age sex chest_pain rest_blood_pressure
## 0.442 0.928 0.745 0.862
## chol blood_sugar restecg max_heart_rate
## 0.963 0.906 0.887 0.005
## exang oldpeak slope
## 0.585 0.511 0.644
##
## Loadings:
## Factor1 Factor2 Factor3
## age -0.374 0.418 0.493
## sex 0.185 -0.114 0.158
## chest_pain 0.328 -0.312 0.223
## rest_blood_pressure -0.104 0.283 0.217
## chol 0.140 0.120
## blood_sugar 0.164 0.252
## restecg 0.108 0.243 0.206
## max_heart_rate 0.997
## exang -0.367 0.461 -0.261
## oldpeak -0.162 0.658 -0.172
## slope 0.385 -0.420 0.179
##
## Factor1 Factor2 Factor3
## SS loadings 1.614 1.293 0.615
## Proportion Var 0.147 0.118 0.056
## Cumulative Var 0.147 0.264 0.320
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 71.4 on 25 degrees of freedom.
## The p-value is 2.38e-06
cbind(fa1$loadings, fa1$uniquenesses)
## Factor1 Factor2 Factor3
## age -0.37356846 0.418455561 4.929469e-01 0.4423447
## sex 0.18546801 -0.114440365 1.580143e-01 0.9275381
## chest_pain 0.32761757 -0.312397003 2.232699e-01 0.7452228
## rest_blood_pressure -0.10374497 0.282533676 2.166863e-01 0.8624582
## chol -0.04857463 0.140387548 1.202523e-01 0.9634767
## blood_sugar -0.06304669 0.164057516 2.516583e-01 0.9057794
## restecg 0.10819513 0.243109549 2.058704e-01 0.8868075
## max_heart_rate 0.99747615 0.006440379 -7.067179e-05 0.0050000
## exang -0.36698962 0.460593452 -2.612099e-01 0.5849391
## oldpeak -0.16198144 0.658000092 -1.719729e-01 0.5112262
## slope 0.38459234 -0.420000816 1.787718e-01 0.6437300
# Factor Analysis for just the numerical variables
fa2 = factanal(heart_fa2, factors = 2, rotation="none",
scores = "regression")
fa2
##
## Call:
## factanal(x = heart_fa2, factors = 2, scores = "regression", rotation = "none")
##
## Uniquenesses:
## age rest_blood_pressure chol max_heart_rate
## 0.569 0.817 0.956 0.005
## oldpeak
## 0.848
##
## Loadings:
## Factor1 Factor2
## age -0.372 0.541
## rest_blood_pressure -0.103 0.415
## chol 0.205
## max_heart_rate 0.997
## oldpeak -0.159 0.357
##
## Factor1 Factor2
## SS loadings 1.172 0.634
## Proportion Var 0.234 0.127
## Cumulative Var 0.234 0.361
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 0.63 on 1 degree of freedom.
## The p-value is 0.429
cbind(fa2$loadings, fa2$uniquenesses)
## Factor1 Factor2
## age -0.37222189 0.541128425 0.5686308
## rest_blood_pressure -0.10294167 0.414851100 0.8173024
## chol -0.04819274 0.205258458 0.9555460
## max_heart_rate 0.99749392 0.002424049 0.0050000
## oldpeak -0.15857495 0.356862345 0.8475034
As a first approach, we get a strong difference between using all the variables (except the stage) and using just the strictly numerical.
When we are using only numerical we get poor model with a p-value around 0.7 and see that we will not get strong relationships and a strong explanation of the variability of the data set. As we have seen before in the PCA.
At the time of studying all variables as numerical, except the target variable, we get a strong model, in which all the SS loadings of the factors are above 0.6, which is actually a quite good result.
However, these results are just a first glance. As we are using the function factanal() we need to know what we are doing and how our parameters may affect our study.
factors -> number of factors to extract, this specifies the dimensionality of the factor solution.
scores -> factor scores are numeric values that represent how much each observed variable contributes to each of the extracted factors. This parameter can take 3 values:
none -> no factor scores will be computed
“regression” -> it will compute the factor scores with the Thompson’s Regression method, which focus on calculating uncorrelated scores and having a minimum sum of squared differences.
“Bartlett” -> it will compute the factor scores with the Bartlett’s weighted least-squares method, which focus on maximising the variance explained by the factors.
rotation -> there are several rotation methods with several effects such as the assumption of relationship between variables. There are several:
In order to see better and clearer the results we will divide from now on between the factor analysis corresponding to each auxiliary data set.
We will start the factor Analysis with the data set heart_fa1 which is composed of all the variables turned into numerical when needed, excluding the feature stage.(Every feature was normalised).
# In short, we have
# heart_fa1 # all(almost)
# heart_fa2 # numerical
# To compare the different results we will follow...
# fa1_X -> factor analisys for the 1st data set
# fa2_X -> factor analisys for the 2nd data set
# Let's start with all the the variables data set
# Changing the scores
fa1_1 = factanal(heart_fa1, factors = 3, rotation= "none",
scores = "none")
fa1_1
##
## Call:
## factanal(x = heart_fa1, factors = 3, scores = "none", rotation = "none")
##
## Uniquenesses:
## age sex chest_pain rest_blood_pressure
## 0.442 0.928 0.745 0.862
## chol blood_sugar restecg max_heart_rate
## 0.963 0.906 0.887 0.005
## exang oldpeak slope
## 0.585 0.511 0.644
##
## Loadings:
## Factor1 Factor2 Factor3
## age -0.374 0.418 0.493
## sex 0.185 -0.114 0.158
## chest_pain 0.328 -0.312 0.223
## rest_blood_pressure -0.104 0.283 0.217
## chol 0.140 0.120
## blood_sugar 0.164 0.252
## restecg 0.108 0.243 0.206
## max_heart_rate 0.997
## exang -0.367 0.461 -0.261
## oldpeak -0.162 0.658 -0.172
## slope 0.385 -0.420 0.179
##
## Factor1 Factor2 Factor3
## SS loadings 1.614 1.293 0.615
## Proportion Var 0.147 0.118 0.056
## Cumulative Var 0.147 0.264 0.320
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 71.4 on 25 degrees of freedom.
## The p-value is 2.38e-06
# "regression" -> SS loadings 1.645 1.248 0.608
# "Bartlett" -> SS loadings 1.645 1.248 0.608
# "none" -> SS loadings 1.645 1.248 0.608
# We see that the scores parameter in this case does not affect
# the result
# Now we will see how the value of rotation affects our results
fa1_2 = factanal(heart_fa1, factors = 3, rotation= "promax",
scores = "none")
fa1_2
##
## Call:
## factanal(x = heart_fa1, factors = 3, scores = "none", rotation = "promax")
##
## Uniquenesses:
## age sex chest_pain rest_blood_pressure
## 0.442 0.928 0.745 0.862
## chol blood_sugar restecg max_heart_rate
## 0.963 0.906 0.887 0.005
## exang oldpeak slope
## 0.585 0.511 0.644
##
## Loadings:
## Factor1 Factor2 Factor3
## age -0.183 0.708
## sex -0.242
## chest_pain -0.477 0.111
## rest_blood_pressure 0.333
## chol 0.178
## blood_sugar 0.322
## restecg 0.201 0.271
## max_heart_rate -0.129 0.918 -0.179
## exang 0.635
## oldpeak 0.704 0.214
## slope -0.540 0.113
##
## Factor1 Factor2 Factor3
## SS loadings 1.505 1.001 0.874
## Proportion Var 0.137 0.091 0.079
## Cumulative Var 0.137 0.228 0.307
##
## Factor Correlations:
## Factor1 Factor2 Factor3
## Factor1 1.0000 0.320 -0.0347
## Factor2 0.3201 1.000 -0.3764
## Factor3 -0.0347 -0.376 1.0000
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 71.4 on 25 degrees of freedom.
## The p-value is 2.38e-06
# None -> SS loadings 1.645 1.248 0.608 -> 3.501
# Varimax -> SS loadings 1.599 0.975 0.926 -> 3.500
# promax -> SS loadings 1.469 0.979 0.899 -> 3.347
# We see that there is practically no difference between using
# varimax or no rotation. The relevant conclusion is to not use
# promax (in this case)
# Then, we will see which number of factors is the best
nFactors::plotnScree(nFactors::nScree(x = eigen(cor(heart_fa1))$values),legend = F, main = "Optimal Amount of Factors (Elbow Method)", xlab = "Number of Factors")
abline(h = 0.90, col = "purple", lty = 6)
# By the Elbow Method we see the optimal amount of factors is between 4 and 5
# Let's compare
fa1_3 = factanal(heart_fa1, factors = 4, rotation= "none",
scores = "none")
fa1_3
##
## Call:
## factanal(x = heart_fa1, factors = 4, scores = "none", rotation = "none")
##
## Uniquenesses:
## age sex chest_pain rest_blood_pressure
## 0.447 0.923 0.698 0.852
## chol blood_sugar restecg max_heart_rate
## 0.929 0.906 0.885 0.005
## exang oldpeak slope
## 0.479 0.566 0.396
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## age -0.374 0.336 0.548
## sex 0.186 -0.125 0.126 -0.103
## chest_pain 0.328 -0.326 0.151 -0.256
## rest_blood_pressure -0.104 0.237 0.273
## chol 0.109 0.162 0.176
## blood_sugar 0.130 0.269
## restecg 0.108 0.212 0.238
## max_heart_rate 0.997
## exang -0.368 0.506 -0.184 0.308
## oldpeak -0.163 0.637
## slope 0.386 -0.547 0.193 0.343
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 1.617 1.326 0.642 0.329
## Proportion Var 0.147 0.121 0.058 0.030
## Cumulative Var 0.147 0.268 0.326 0.356
##
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 30.9 on 17 degrees of freedom.
## The p-value is 0.0205
fa1_4 = factanal(heart_fa1, factors = 5, rotation= "none",
scores = "none")
fa1_4
##
## Call:
## factanal(x = heart_fa1, factors = 5, scores = "none", rotation = "none")
##
## Uniquenesses:
## age sex chest_pain rest_blood_pressure
## 0.459 0.319 0.698 0.849
## chol blood_sugar restecg max_heart_rate
## 0.893 0.885 0.884 0.005
## exang oldpeak slope
## 0.492 0.545 0.454
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4 Factor5
## age -0.374 0.249 0.331 0.479
## sex 0.187 -0.473 0.635 -0.135
## chest_pain 0.328 -0.348 0.140 -0.230
## rest_blood_pressure -0.104 0.190 0.201 0.238
## chol 0.189 0.101 0.238
## blood_sugar 0.129 0.295
## restecg 0.108 0.167 0.176 0.208
## max_heart_rate 0.997
## exang -0.368 0.494 0.125 -0.198 0.271
## oldpeak -0.163 0.578 0.285 -0.107
## slope 0.386 -0.447 -0.223 0.221 0.313
##
## Factor1 Factor2 Factor3 Factor4 Factor5
## SS loadings 1.617 1.268 0.772 0.564 0.295
## Proportion Var 0.147 0.115 0.070 0.051 0.027
## Cumulative Var 0.147 0.262 0.332 0.384 0.411
##
## Test of the hypothesis that 5 factors are sufficient.
## The chi square statistic is 11.75 on 10 degrees of freedom.
## The p-value is 0.302
# 4 Factors -> SS loadings 1.612 1.289 0.638 0.601
# 5 Factors -> SS loadings 1.544 1.486 0.676 0.573 0.254
# We can discard the last factors for fa1_4
# The best option is 4 Factors
fa1_best = fa1_3
fa1_best
##
## Call:
## factanal(x = heart_fa1, factors = 4, scores = "none", rotation = "none")
##
## Uniquenesses:
## age sex chest_pain rest_blood_pressure
## 0.447 0.923 0.698 0.852
## chol blood_sugar restecg max_heart_rate
## 0.929 0.906 0.885 0.005
## exang oldpeak slope
## 0.479 0.566 0.396
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## age -0.374 0.336 0.548
## sex 0.186 -0.125 0.126 -0.103
## chest_pain 0.328 -0.326 0.151 -0.256
## rest_blood_pressure -0.104 0.237 0.273
## chol 0.109 0.162 0.176
## blood_sugar 0.130 0.269
## restecg 0.108 0.212 0.238
## max_heart_rate 0.997
## exang -0.368 0.506 -0.184 0.308
## oldpeak -0.163 0.637
## slope 0.386 -0.547 0.193 0.343
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 1.617 1.326 0.642 0.329
## Proportion Var 0.147 0.121 0.058 0.030
## Cumulative Var 0.147 0.268 0.326 0.356
##
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 30.9 on 17 degrees of freedom.
## The p-value is 0.0205
After getting the best model, 4-factor with neither scores nor rotation, we will interpret the results.
# Auxiliary data frame to store all the the loadings per factor
factors1 = data.frame(Variable = rownames(fa1_best$loadings),
Factor1 = fa1_best$loadings[, 1],
Factor2 = fa1_best$loadings[, 2],
Factor3 = fa1_best$loadings[, 3],
Factor4 = fa1_best$loadings[, 4])
# Now having a column that indicates the factor to which the
# loading belongs to
loadings_fa1 = tidyr::gather(factors1, Factor,
Loading, -Variable)
# Final barplot with all the Factors and their loadings
gFactors1 = ggplot(data = loadings_fa1, aes(x = Variable,
y = Loading,
fill = Factor)) +
geom_bar(stat = "identity") +
labs(title = "Factor Analysis Loadings",
subtitle = "All variables except stage",
x = "Variable",
y = "Loadings") + theme_minimal() +
theme(axis.text.x = element_text(angle = 45,
hjust = 1)) +
facet_wrap(~Factor, scales = "free")
gFactors1
From this first Factor Analysis procedure we see 4 factors, that show us 4 different tendencies of patients. All of the factors are not defined by the cholesterol, it is a variable that does not give us much information about the patient.
Factor 1 -> shows us patients that its heart behavior can be explained mainly by the slope.
The age and the max_heart_rate are negatively correlated, as we expected before. But giving more importance to the max_heart_rate, this observation is quite relevant as in the data set we had much less observations of young people than older people. In addition to the low influence of the sex, it has influence but just a small percentage. Therefore, it is fair to assume that this factor is explaining younger patients.
With exang and oldpeak we have a positive correlation between them (negative in overall), which makes sense: if you do not show discomfort while exercising (exang) then the normal thing would be to not show any kind of heart depression (oldpeak).
From these statements we can deduce that this factor is focused on the tendency of healthy (or usual) young patients, as their correlations follow what a healthy (or usual for what a younger) person would get, therefore the weight goes to the slope in which the original value 1 meant no main problems, whereas 0 and 2 meant heart consequences.
Factor 2 -> shows directly that the higher the age the lower the max_heart_rate, most of the weight of these patients stays between that duality. But sharing importance with the feature exang.
Another duality seen, but with a positive relationship is chest_pain and exang. In this case the relationships shows more weight to exang, this relationship is coherent as if an individual experienced some discomfort while exercising exang would be True and the chest_pain would be in between 0, 1 or 2 but no 3. The features oldpeak, rest_blood_pressure and sex follow adequate results.
Due to the positive relevance of the age, other features related to the appearance of heart issues and the dualities mentioned, we can conclude that this factor explains older patients in moderate heart risks.
Factor 3 -> the main thing to mention is an overall positively correlation between all the features (excluding the ones with no relevance). In this factor it can be deduced simply that it shows the patients that have a tendency of having big max_heart_rate, oldpeak, rest_blood_pressure, which are usually older patients but the age does not have an outstanding importance. Due to this high importance on values that when positive affect badly the heart’s health of the patient we deduce that this factor describes the tendency of patients with high heart risks.
Factor 4 -> in this factor we see the usual behaviour between age and max_heart_rate. However, here the max_heart_rate is not decreasing as strong as it does the age, which is a good indicator of heart’s health.
We see lots of indicators of a healthy heart, not experiencing discomfort while exercising while increasing age (exang), almost no effect of increasing age on oldpeak. Also the rest of the features have adequate values. Therefore we can conclude that this factor is explaining the tendency of overall healthy patients.
The next Factor Analysis procedure will be focused on the auxiliary data set heart_fa2, which is composed with the strictly numerical variables (just 5).
# Our data set to work is heart_fa2
fa2 = factanal(heart_fa2, factors = 2, rotation = "promax",
scores = "none")
fa2
##
## Call:
## factanal(x = heart_fa2, factors = 2, scores = "none", rotation = "promax")
##
## Uniquenesses:
## age rest_blood_pressure chol max_heart_rate
## 0.569 0.817 0.956 0.005
## oldpeak
## 0.848
##
## Loadings:
## Factor1 Factor2
## age -0.107 0.610
## rest_blood_pressure 0.453
## chol 0.224
## max_heart_rate 0.966
## oldpeak 0.395
##
## Factor1 Factor2
## SS loadings 0.956 0.790
## Proportion Var 0.191 0.158
## Cumulative Var 0.191 0.349
##
## Factor Correlations:
## Factor1 Factor2
## Factor1 1.000 -0.366
## Factor2 -0.366 1.000
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 0.63 on 1 degree of freedom.
## The p-value is 0.429
# Let's see the optimal number of factors
nFactors::plotnScree(nFactors::nScree(x = eigen(cor(heart_fa2))$values),legend = F, main = "Optimal Amount of Factors (Elbow Method)", xlab = "Number of Factors")
# 2 factors
# Then study scores and rotation
# scores
# regression -> SS loadings 1.118 0.327
# Bartlett -> SS loadings 1.118 0.327
# none -> SS loadings 1.118 0.327
# Rotation
# none -> SS loadings 1.118 0.327 -> 1.445
# varimax -> SS loadings 0.866 0.578 -> 1.444
# promax -> SS loadings 0.936 0.494 -> 1.43
# Between varimax and none we will use the one that helps us more to visualise
fa2_best = factanal(heart_fa2, factors = 2,
rotation = "varimax", scores = "none")
We will study graphically first and analyse the results.
# Plotting (same as before but with less factors)
# Auxiliary data frame to store all the the loadings per factor
factors2 = data.frame(Variable = rownames(fa2_best$loadings),
Factor1 = fa2_best$loadings[, 1],
Factor2 = fa2_best$loadings[, 2])
# Now having a column that indicates the factor to which the
# loading belongs to
loadings_fa2 = tidyr::gather(factors2, Factor,
Loading, -Variable)
# Final barplot with all the Factors and their loadings
gFactors2 = ggplot(data = loadings_fa2, aes(x = Variable,
y = Loading,
fill = Factor)) +
geom_bar(stat = "identity") +
labs(title = "Factor Analysis Loadings",
subtitle = "Just Numerical Variables",
x = "Variable",
y = "Loadings") + theme_minimal() +
theme(axis.text.x = element_text(angle = 45,
hjust = 1)) +
facet_wrap(~Factor)
gFactors2
From this second Factor Analysis procedure we see 2 factors, that show us 2 different tendencies of patients.
Factor 1 -> we see that the higher the age the lower the max_heart_rate, which is a coherent relation. We also see positive relationship with oldpeak and rest_blood_pressure, which is an adequate result knowing that when you get older the usual thing is to be more prone to those kind of problems. We conclude that this factor explains the usual heart’s health risk of a patient.
Factor 2 -> explains the usual relationship between age and max_heart_rate, but very attenuated. In contrast to the importance of rest_blood_pressure. This results makes us think that this factor is focused more on unusual health risk on healthy patients. However this factor is not as robust as the one before, as its SS loadings are 0.578.
Applying what we have learned before, we will now create a model but for all the variables including the stage.
heart_fa3 <- heart %>% select(-heart_disease, - ageGroups)
for(i in which(sapply(heart_fa3, class) == "factor")) {
heart_fa3[,i] <- as.numeric(as.character(heart_fa3[,i]))
}
glimpse(heart_fa3)
## Rows: 854
## Columns: 12
## $ age <dbl> 0.7142857, 0.7959184, 0.7959184, 0.1836735, 0.2653…
## $ sex <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ chest_pain <dbl> 1, 0, 0, 3, 2, 2, 0, 0, 0, 0, 0, 2, 3, 2, 3, 3, 2,…
## $ rest_blood_pressure <dbl> 0.6022727, 0.7727273, 0.3181818, 0.4318182, 0.4318…
## $ chol <dbl> 0.4789644, 0.6504854, 0.4660194, 0.5339806, 0.3851…
## $ blood_sugar <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ restecg <dbl> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0,…
## $ max_heart_rate <dbl> 0.6338028, 0.3380282, 0.4859155, 0.8943662, 0.7887…
## $ exang <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRU…
## $ oldpeak <dbl> 0.5568182, 0.4659091, 0.5909091, 0.6931818, 0.4545…
## $ slope <dbl> 0, 1, 1, 0, 2, 2, 0, 2, 1, 0, 1, 1, 1, 2, 2, 2, 0,…
## $ stage <dbl> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1,…
# Then we normalise the variables that were categorical and
# had different values from [0, 1] strictly:
# chest_pain; restcg; slope and stage
heart_fa3$chest_pain = (heart_fa3$chest_pain -
min(heart_fa3$chest_pain)) /
(max(heart_fa3$chest_pain)
- min(heart_fa3$chest_pain))
heart_fa3$restecg = (heart_fa3$restecg -
min(heart_fa3$restecg)) /
(max(heart_fa3$restecg)
- min(heart_fa3$restecg))
heart_fa3$slope = (heart_fa3$slope -
min(heart_fa3$slope)) /
(max(heart_fa3$slope)
- min(heart_fa3$slope))
heart_fa3$stage = (heart_fa3$stage -
min(heart_fa3$stage)) /
(max(heart_fa3$stage)
- min(heart_fa3$stage))
Now we have our full numerical data set ready to go.
# The next step is build the proper Factor model
fa3_1 = factanal(heart_fa3, factors = 3, rotation = "none",
scores = "Bartlett")
fa3_1
##
## Call:
## factanal(x = heart_fa3, factors = 3, scores = "Bartlett", rotation = "none")
##
## Uniquenesses:
## age sex chest_pain rest_blood_pressure
## 0.416 0.903 0.703 0.875
## chol blood_sugar restecg max_heart_rate
## 0.966 0.908 0.884 0.005
## exang oldpeak slope stage
## 0.602 0.537 0.674 0.515
##
## Loadings:
## Factor1 Factor2 Factor3
## age -0.374 0.415 0.521
## sex 0.186 -0.171 0.181
## chest_pain 0.329 -0.362 0.240
## rest_blood_pressure -0.104 0.249 0.227
## chol 0.130 0.123
## blood_sugar 0.171 0.242
## restecg 0.108 0.251 0.204
## max_heart_rate 0.997
## exang -0.368 0.453 -0.239
## oldpeak -0.163 0.646 -0.141
## slope 0.385 -0.395 0.147
## stage -0.380 0.575
##
## Factor1 Factor2 Factor3
## SS loadings 1.761 1.613 0.637
## Proportion Var 0.147 0.134 0.053
## Cumulative Var 0.147 0.281 0.334
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 113.94 on 33 degrees of freedom.
## The p-value is 7.76e-11
# We make the same comparison as before
# First with scores (does not matter)
# none -> SS loadings 1.792 1.564 0.622
# regression -> SS loadings 1.792 1.564 0.622
# Bartlett -> SS loadings 1.792 1.564 0.622
# Then with the rotation (between none and varimax)
# none -> SS loadings 1.792 1.564 0.622 -> 3.978
# varimax -> SS loadings 2.001 0.992 0.984 -> 3.977
# promax -> SS loadings 1.902 0.990 0.888 -> 3.78
# Now number of factors
nFactors::plotnScree(nFactors::nScree(x = eigen(cor(heart_fa3))$values),legend = F, main = "Optimal Amount of Factors (Elbow Method)", xlab = "Number of Factors")
# 4 is the ideal number of factors
fa3_2 = factanal(heart_fa3, factors = 4, rotation= "varimax",
scores = "none")
fa3_2
##
## Call:
## factanal(x = heart_fa3, factors = 4, scores = "none", rotation = "varimax")
##
## Uniquenesses:
## age sex chest_pain rest_blood_pressure
## 0.428 0.847 0.645 0.872
## chol blood_sugar restecg max_heart_rate
## 0.964 0.896 0.880 0.025
## exang oldpeak slope stage
## 0.630 0.316 0.650 0.455
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## age 0.156 -0.261 0.687
## sex 0.386
## chest_pain 0.554 -0.179 0.107
## rest_blood_pressure 0.143 0.325
## chol 0.185
## blood_sugar 0.319
## restecg 0.170 0.298
## max_heart_rate 0.352 -0.143 0.904 -0.118
## exang -0.453 0.363 -0.156
## oldpeak -0.240 0.766 0.188
## slope 0.259 -0.467 0.234
## stage -0.585 0.318 0.303
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 1.264 1.140 1.022 0.966
## Proportion Var 0.105 0.095 0.085 0.080
## Cumulative Var 0.105 0.200 0.285 0.366
##
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 59.28 on 24 degrees of freedom.
## The p-value is 8.06e-05
# Results
# varimax -> SS loadings 1.578 1.110 1.034 0.935
# 4.657
# none -> SS loadings 2.376 0.890 0.771 0.620
# 4.657
# Varimax is basically minimising the variance between factors
# Let's study with varimax because it will make our study easier
# to see
fa3_best = factanal(heart_fa3, factors = 4, rotation= "varimax",
scores = "none")
fa3_best
##
## Call:
## factanal(x = heart_fa3, factors = 4, scores = "none", rotation = "varimax")
##
## Uniquenesses:
## age sex chest_pain rest_blood_pressure
## 0.428 0.847 0.645 0.872
## chol blood_sugar restecg max_heart_rate
## 0.964 0.896 0.880 0.025
## exang oldpeak slope stage
## 0.630 0.316 0.650 0.455
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## age 0.156 -0.261 0.687
## sex 0.386
## chest_pain 0.554 -0.179 0.107
## rest_blood_pressure 0.143 0.325
## chol 0.185
## blood_sugar 0.319
## restecg 0.170 0.298
## max_heart_rate 0.352 -0.143 0.904 -0.118
## exang -0.453 0.363 -0.156
## oldpeak -0.240 0.766 0.188
## slope 0.259 -0.467 0.234
## stage -0.585 0.318 0.303
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 1.264 1.140 1.022 0.966
## Proportion Var 0.105 0.095 0.085 0.080
## Cumulative Var 0.105 0.200 0.285 0.366
##
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 59.28 on 24 degrees of freedom.
## The p-value is 8.06e-05
The next step is to visualise and get conclusions.
# Same procedure as before
# Auxiliary data frame to store all the the loadings per factor
factors3 = data.frame(Variable = rownames(fa3_best$loadings),
Factor1 = fa3_best$loadings[, 1],
Factor2 = fa3_best$loadings[, 2],
Factor3 = fa3_best$loadings[, 3],
Factor4 = fa3_best$loadings[, 4])
# Now having a column that indicates the factor to which the
# loading belongs to
loadings_fa3 = tidyr::gather(factors3, Factor,
Loading, -Variable)
# Final barplot with all the Factors and their loadings
gFactors3 = ggplot(data = loadings_fa3, aes(x = Variable,
y = Loading,
fill = Factor)) +
geom_bar(stat = "identity") +
labs(title = "Factor Analysis Loadings",
subtitle = "All variables",
x = "Variable",
y = "Loadings") + theme_minimal() +
theme(axis.text.x = element_text(angle = 45,
hjust = 1)) +
facet_wrap(~Factor, scales = "free")
gFactors3
This Factor Analysis procedure confirmed our assumptions that the variable stage would have a lot of weight and make the model less flexible. We see a clear distinctions of the different stages between the factors, being the factors with higher stage importance the unhealthier tendencies and the ones with the lower stage are the healthier tendencies.
In short, thanks to the different Factor Analysis we studied before, we conclude that the stronger model was using all the features, except the stage, making numerical the ones needed.
The FA showed us that there are 4 main tendencies healthy (or usual) young patients, older patients in moderate heart risks, patients with high heart risks and overall healthy patients. These factors can explain in the most precise way the different tendencies of the patients without loosing flexibility while being accurate enough.
First we make an initial guess of 3 clusters (healthy, disease and confusing/low disease stage)
set.seed(321)
heart_clust = scale(heart_pca) # Since it has all the numeric variables
k = 3
fit_clust = kmeans(heart_clust, centers=k, nstart=1000)
groups = fit_clust$cluster
par(mfrow=c(1,1))
barplot(table(groups), col="blue")
We obtained 3 groups where the largest number of observations show up in the second cluster.
Interpretation of the centers
centers=fit_clust$centers
barplot(centers[1,], las=2, col="darkblue") # High negative max_heart_rate, negative all except age
barplot(centers[2,], las=2, col="darkblue") # High all positive except max_heart_rate (negative)
barplot(centers[3,], las=2, col="darkblue") # High positive max_heart_rate, high negative age
We understand that the clusters obtained are:
the first one for normal people (values near 0 except for heart rate).
the second for the people with the disease (high levels of age, chol, oldpeak).
the third one for healthy people (young and with a good positive heart rate).
Clustplot
# clusplot
fviz_cluster(fit_clust, data = heart_clust, geom = c("point"), ellipse.type = 'norm') +
theme_minimal()+
geom_text(label=heart$stage,hjust=0, vjust=0,size=3,check_overlap = F)+
scale_fill_brewer(palette="Paired")
We can see that the majority of points that belong to the cluster 3 are from the stage 0 (healthy), however, the other 2 groups are pretty messed up. We can try to reduce to 2 clusters:
k = 2
fit_clust = kmeans(heart_clust, centers=k, nstart=1000)
groups = fit_clust$cluster
barplot(table(groups), col="blue")
barplot(centers[1,], las=2, col="darkblue") # Low positive age, remaining low negative except a very negative max_heart_rate
barplot(centers[2,], las=2, col="darkblue") # All features are high except heart_rate
A priori by looking to the centers information, we can assume that the first cluster is related to young people, maybe more healthy aspects for the other variables, and the second cluster for the older people, with higher risk of having a heart disease.
# clusplot
fviz_cluster(fit_clust, data = heart_clust, geom = c("point"), ellipse.type = 'norm') +
theme_minimal()+
geom_text(label=heart$stage,hjust=0, vjust=0,size=3,check_overlap = T)+
scale_fill_brewer(palette="Paired")
We obtained a cluster with many healthy people whereas the other one has all kind of people. We assume that this happens because the center of the “healthy” cluster has a low age, which is strongly correlated with not having the heart disease.
ggplot()+
aes(x = heart_not_norm$stage, y = heart_not_norm$age)+
geom_violin(fill = "lightyellow")+
geom_jitter(aes(color = as.factor(fit_clust$cluster)))+
theme_minimal() +
ggtitle("Heart Disease or not by age in each cluster") +
ylab("Age") +
xlab("Stage of the disease") +
labs(color = "Cluster number")
As we go up in the stage of the disease, we can see that less observations belong to the first cluster. We can also observe that the second cluster is mainly composed by older people (the ones who tend to have the disease).
Silhouette Plot
d <- dist(heart_clust, method="euclidean")
sil = silhouette(groups, d)
plot(sil, col=1:5, main="", border=NA)
summary(sil)
## Silhouette of 854 units in 2 clusters from silhouette.default(x = groups, dist = d) :
## Cluster sizes and average silhouette widths:
## 447 407
## 0.2545616 0.1720360
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.06103 0.12366 0.21319 0.21523 0.31111 0.45441
# the same with factoextra
fviz_silhouette(eclust(heart_clust, "kmeans", stand=TRUE, k=2))
## cluster size ave.sil.width
## 1 1 407 0.17
## 2 2 447 0.25
We can see that the clusters silhouette width with two clusters can be considered well-clustered. This is because there are very few observations near 0 and even less with a negative value (which means that they might be in the wrong cluster). Therefore, we assume that 2 clusters is the ideal amount. However, we can prove it
fviz_nbclust(heart_clust, kmeans, method = 'silhouette', k.max = 15, nstart = 500)
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
fviz_nbclust(heart_clust, kmeans, method = 'wss', k.max = 15, nstart = 500) +
geom_vline(xintercept = 6, linetype = 2) #elbow location
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
fviz_nbclust(heart_clust, kmeans, method = 'gap_stat', k.max = 10, nstart = 100, nboot = 500)
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
We can see that for the silhouette and the gap_stat methods, the best number of clusters is 2. However, the wss method proposes a higher number of clusters, around 6, so lets study this option.
k = 6
fit_clust = kmeans(heart_clust, centers=k, nstart=1000)
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
groups = fit_clust$cluster
barplot(table(groups), col="blue")
centers=fit_clust$centers
barplot(centers[1,], las=2, col="darkblue")
barplot(centers[2,], las=2, col="darkblue")
barplot(centers[3,], las=2, col="darkblue")
barplot(centers[4,], las=2, col="darkblue")
barplot(centers[5,], las=2, col="darkblue")
barplot(centers[6,], las=2, col="darkblue")
# clusplot
fviz_cluster(fit_clust, data = heart_clust, geom = c("point"), ellipse.type = 'norm') +
theme_minimal()+
geom_text(label=heart$stage,hjust=0, vjust=0,size=3,check_overlap = T)+
scale_fill_brewer(palette="Paired")
# the same with factoextra
fviz_silhouette(eclust(heart_clust, "kmeans", stand=TRUE, k=6))
## cluster size ave.sil.width
## 1 1 124 0.17
## 2 2 178 0.21
## 3 3 134 0.14
## 4 4 174 0.19
## 5 5 106 0.14
## 6 6 138 0.16
We can see that several clusters have negative siluette width and, therefore, it may not be the best number of clusters.
Partitioning (clustering) of the data into k clusters *around medoids*
More robust version than k-means, the centers are now patients
fit.pam <- eclust(heart_clust, "pam", stand=TRUE, k=2, graph=F)
fviz_cluster(fit.pam, data = heart_clust, geom = c("point"), pointsize=0.5) +
theme_minimal() +
geom_text(label= heart$stage, hjust=0, vjust=0, size=4, check_overlap = T) +
scale_fill_brewer(palette="Paired")
We can see that the cluster plot is very simmilar to the previous one, where cluster 2 has mostly stage 0 observations (healthy people).
Number of groups by pam:
fviz_nbclust(heart_clust, pam, method = 'silhouette', k.max = 10)
fviz_nbclust(heart_clust, pam, method = 'gap_stat', k.max = 10, nboot = 500)
By both methods we have that 2 clusters would be the best.
# The closer to 1 the more agreement
fit_clust = kmeans(heart_clust, centers=2, nstart=1000)
fit.pam <- eclust(heart_clust, "pam", stand=TRUE, k=2, graph=F)
adjustedRandIndex(fit_clust$cluster, fit.pam$clustering)
## [1] 0.6040579
There is some kind of agreement between them, however it may seem low comparing it to what we expected.
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:mice':
##
## convergence
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
fit.ker <- kkmeans(as.matrix(heart_clust), centers=2, kernel="rbfdot") # Radial Basis kernel (Gaussian)
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
# By default, Gaussian kernel is used
# By default, sigma parameter is estimated
centers(fit.ker)
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.6115822 -0.4361056 -0.2499453 0.4850456 -0.5372957
## [2,] 0.5974252 0.4260105 0.2441596 -0.4738177 0.5248583
size(fit.ker)
## [1] 422 432
withinss(fit.ker)
## [1] 2007.595 2580.554
object.ker = list(data = heart_clust, cluster = fit.ker@.Data)
fviz_cluster(object.ker, geom = c("point"), ellipse=F,pointsize=1)+
theme_minimal()+
geom_text(label= heart$stage,hjust=0, vjust=0,size=3,check_overlap = T)+
scale_fill_brewer(palette="Paired")
adjustedRandIndex(fit_clust$cluster, fit.ker)
## [1] 0.8340121
adjustedRandIndex(fit.pam$cluster, fit.ker)
## [1] 0.6298727
We obtained a value of 0.74 for both the relationship with the basic clustering and with PAM. Therefore, we can say that the results obtained are very similar to eachother.
# We decide the distance and linkage
d = dist(scale(heart_clust), method = "euclidean")
hc <- hclust(d, method = "ward.D2")
hc$labels <- heart$heart_disease
fviz_dend(x = hc,
k=2,
palette = "jco",
rect = TRUE, rect_fill = TRUE, cex=0.5,
rect_border = "jco"
)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We cannot obtain much information from this plot (We cannot read the labels).
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:lubridate':
##
## %--%, union
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
fviz_dend(x = hc,
k = 2,
color_labels_by_k = TRUE,
cex = 0.8,
type = "phylogenic",
repel = TRUE)+ labs(title="Socio-economic-health tree clustering of the world") + theme(axis.text.x=element_blank(),axis.text.y=element_blank())
Not a very useful plot.
The last clustering technique will be the Expectation-Maximization clustering, which is like k-means but it computes the probabilities that an observation has of belonging to a certain cluster based on probability distributions.
# We will follow the same strategy using the numerical variables
heart_clust_EM = Mclust(scale(heart_clust))
summary(heart_clust_EM)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEV (ellipsoidal, equal shape) model with 7 components:
##
## log-likelihood n df BIC ICL
## -5309.57 854 122 -11442.63 -11767.01
##
## Clustering table:
## 1 2 3 4 5 6 7
## 162 107 122 109 101 162 91
# The best in this case is using 8 clusters
head(heart_clust_EM$z)
## [,1] [,2] [,3] [,4] [,5]
## 1 5.860221e-01 7.785648e-75 4.139582e-01 1.692244e-106 1.970352e-05
## 2 5.598000e-06 2.667645e-32 3.114209e-03 4.242570e-53 9.968749e-01
## 3 3.494458e-03 5.074990e-96 9.965055e-01 9.981290e-134 2.663456e-08
## 4 1.475420e-03 1.639270e-179 8.589986e-01 2.119396e-238 1.183870e-26
## 5 2.126657e-06 7.205824e-33 5.992572e-01 2.615048e-37 2.937967e-01
## 6 6.448621e-01 2.250531e-13 2.412705e-05 4.209271e-14 3.551137e-01
## [,6] [,7]
## 1 4.618307e-74 2.660980e-12
## 2 8.065858e-36 5.275209e-06
## 3 1.431838e-95 9.055168e-15
## 4 3.610659e-175 1.395260e-01
## 5 2.299156e-27 1.069440e-01
## 6 9.674786e-09 4.242689e-24
head(heart_clust_EM$classification)
## 1 2 3 4 5 6
## 1 5 3 3 3 1
fviz_mclust(object = heart_clust_EM, what = "BIC",
pallet = "jco")
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Now we get 8 clusters, however, it seems that due to the quantity between some close clusters it will tend in the end to show between 4 and 2 real clusters as we have studied before.
Now, talking more in deep about the values of BIC and ICL we got good results as the lower those values the better.
We will plot our results and see if our assumptions are right.
fviz_mclust(object = heart_clust_EM, what = "classification",
geom = "point", pallet = "jco")
Our assumptions was right as we see there is some sort of overlapping between observations along the center of the plot, the right, top left and bottom left.
Therefore if we compare the agreement to the previous clusters, we will not get a number closer to 1 due to the huge difference of number of clusters. (We will not plot a heatmap as we do not have a clear way to analyse results due to the labels).
# Computes the adjusted Rand index comparing two classification
# The closer to 1 the more agreement
adjustedRandIndex(heart_clust_EM$classification,
fit.pam$clustering)
## [1] 0.1191792
# heatmap(scale(heart_clust), scale = "none",
# distfun = function(x){dist(x, method = "euclidean")},
# hclustfun = function(x){hclust(x, method =
# "ward.D2")},
# cexRow = 0.7) # does not give us clear information
Nevertheless, due to the overlapping we could say that there are between 4 and 2 tendencies of patients relating to their heart’s health, the usual patients are affected by their age, whereas the healthiest are less affected.